Explore chapters and articles related to this topic
Artificial Intelligence Software and Hardware Platforms
Published in Mazin Gilbert, Artificial Intelligence for Autonomous Networks, 2018
Rajesh Gadiyar, Tong Zhang, Ananth Sankaranarayanan
To help developers better leverage their hardware, venders have provided a series of inference optimizations for GPUs or CPUs (e.g., Nvidia’s cuDNN library and Intel’s MKL-DNN library). As an example, the latest versions of cuDNN include improvements on small batch inference by splitting in an additional dimension. This reduces the amount of computation per thread block and enables launching significantly more blocks, increasing GPU occupancy and performance [21].
Stream Processing Programming with CUDA, OpenCL, and OpenACC
Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019
Each SM can execute one or more thread blocks; however, a thread block cannot run on multiple SMs. The grid is a set of loosely coupled thread blocks (expressing coarse-grained data parallelism), and a thread block is a set of tightly coupled threads (expressing fine-grained data/thread parallelism).
High-performance attribute reduction on graphics processing unit
Published in Journal of Experimental & Theoretical Artificial Intelligence, 2020
CUDA programs need to explicitly manage the computing and storage resources of a GPU. In NVIDIA GPUs, threads are organised hierarchically and they are the basic unit of instruction execution. Then, 32 threads form a bundle (called warp) which is the basic unit of thread scheduling. Threads in a warp are executed in Single Instruction Multiple Threads (SIMT) mode, a.k.a. these threads share a single multithreaded instruction unit. Furthermore, a number of threads constitute a thread block, and a number of thread blocks form a grid. For example, a thread block can contain up to 1024 threads in a Kepler GPU. A kernel thus consists of a grid of one or more thread blocks. The threads in a thread block are concurrent and they can cooperate amongst themselves through barrier synchronisation and a per-block shared memory space private to that block.
A GPU-Accelerated Filtered Density Function Simulator of Turbulent Reacting Flows
Published in International Journal of Computational Fluid Dynamics, 2020
M. Inkarbekov, A. Aitzhan, A. Kaltayev, S. Sammak
where blockDim is the number of threads in the block for a specific direction, threadIdx is used to access the index of a thread inside a thread block and blockIdx is used to access an index of a thread block inside a thread grid.
An adaptive approach for compression format based on bagging algorithm
Published in International Journal of Parallel, Emergent and Distributed Systems, 2023
Cui Huanyu, Han Qilong, Wang Nianbin, Wang Ye
In this section, we first provide the criteria for selecting the sparse matrix type, and make the data set be used separately on a single GPU, where the matrix is selected according to the following information: Select a matrix suitable for the global memory space used by a single GPU graphics card. Analysis in terms of data size, if a given sparse matrix's DIA compression format or ELL format size is suitable for GPU memory (8 * DIA data length/ELL data length for double precision calculations) less than 80% of the available GPU global memory.The values of non-zero elements in sparse matrix do not have complex values.The length and width of matrices are the same, that is to say, shape of the matrices are square. If the length and width of the matrix are not the same, when long rows * short columns, with the increase of the number of columns of different matrices, the performance of SPMV will gradually increase until stable. But increasing the number of columns in the matrix will precipitate a performance decline. When short rows * long columns, the performance of SPMV decreases faster as the number of matrix rows decreases. Therefore, in order to avoid other factors except the performance changes caused by matrix type, the sparse matrix shape with length and width of same matrix is selected.Matrix is asymmetric or symmetric. The matrix type is mainly for ELL or HYB sparsematrix compression format.The total number of rows in the matrix should be at least equal to the warp concurrency, which is the ratio of the total number of threads in all stream multiprocessor SM to the size of warp. Each SM can simultaneously process 2 thread blocks in parallel. Each thread block contains 1024 threads, so an SM can simultaneously process 2048 threads.Matrices contains a variety of matrix types, including diagonal matrix and irregular matrix. These two matrix types are mainly selected for DIA and CSR compression format to determine whether they can effectively handle the corresponding type of matrix, and to test the prediction accuracy of adaptive compression format.The matrix data is derived from the sparse matrix collection at the University of Florida at https://sparse.tamu.edu/. The file format of the data is mtx. Most mtx files contain sparse matrix rows, columns, and values of non-zero elements. Row and column values are integers, and non-zero elements values may be complex,integers, decimals, and zeros.