Explore chapters and articles related to this topic
High-Performance Computing
Published in Dale A. Anderson, John C. Tannehill, Richard H. Pletcher, Munipalli Ramakanth, Vijaya Shankar, Computational Fluid Mechanics and Heat Transfer, 2020
Dale A. Anderson, John C. Tannehill, Richard H. Pletcher, Munipalli Ramakanth, Vijaya Shankar
The principal difference between the two is a result of each being optimized for a different purpose as shown in Figure 11.3. The multicore CPU architecture consists of general-purpose cores accessing large amounts of data using a resident 1 or 2 levels of cache, optimized for low latency (time taken to fetch data from main memory), with a control logic allowing out of order execution of instructions. The GPU, however, has a very large number of cores with very little cache, optimized for throughput in a data parallel computation. Essentially, the multicore CPUs (10s of cores) are general-purpose processors with fast access to memory, while the GPUs are massively parallel (1000s of cores) processors with strong arithmetic capability, but with slow access to main memory. Very impressive performance has been demonstrated for a variety of applications on GPUs (see GPU GEMS, Fernando 2004; Pharr 2005; and Nguyen 2008). Programming languages for the GPU have evolved substantially since their origins about a decade ago, and many easily accessible references, for example, Kirk and Hwu (2017) are available for learning. CUDA and OpenCL are the leading GPU programming platforms, with CUDA proprietary to NVIDIA Corporation and OpenCL being open source.
Integration of Graphics Processing Cores with Microprocessors
Published in Tomasz Wojcicki, Krzysztof Iniewski, VLSI: Circuits for Emerging Applications, 2017
Deepak C. Sekar, Chinnakrishnan Ballapuram
OpenCL: This is a standard that provides a framework to parallelize programs for heterogeneous systems [6]. Programs written using OpenCL can not only take advantage of multiple CPU cores and GPU cores but also can use other heterogeneous processors in the system. OpenCL’s main goal is to use all resources in the system and offer superior portability. It uses a data and task parallel computational model and abstracts the underlying hardware. Data management is similar to CUDA, where the application has to explicitly manage the data transfer between main memory and device memory.
Tools and Methodologies for FPGA-Based Design
Published in Juan José Rodríguez Andina, Eduardo de la Torre Arnanz, María Dolores Valdés Peña, FPGAs, 2017
Juan José Rodríguez Andina, Eduardo de la Torre Arnanz, María Dolores Valdés Peña
OpenCL ensures code portability between different computing devices, although performance is not guaranteed. It is clear that the computation model underneath the code and the hardware architecture on which it is executed play a crucial role in the resulting performance. As a matter of fact, if the code is not written carefully enough, performance can be degraded to the extent that it can be worse than that achieved using a single processor.
Fast and Energy-Efficient Block Ciphers Implementations in ARM Processors and Mali GPU
Published in IETE Journal of Research, 2022
W. K. Lee, Raphael C.-W. Phan, B. M. Goi
Parallel Work Items. One of the important aspects in optimizing OpenCL programs is the load distribution between the processing elements of a device. Open CL organizes the threads pool in work-groups and work-items; each work-group contains certain number of work-items. OpenCL offers a function clEnqueueNDRangeKernel() to execute a kernel on a device with parameters to control the number of work-items and the number of work-group. Following the suggestion from the Mali OpenCL Developer Guide [9], we set the parameter global_work_size to be total parallel threads required to perform computation and set the local work size parameter to NULL. This allows the OpenCL driver to determine the most efficient workgroup size for the kernel. This technique only works well for algorithms that do not share data among work-items. Block ciphers operating in CTR mode fall into this category of algorithms. Referring to Figure 5, all encryption processes can be executed in parallel as there is no data dependency between different counter blocks.
Implementation of the parallel mean shift-based image segmentation algorithm on a GPU cluster
Published in International Journal of Digital Earth, 2019
Fang Huang, Yinjie Chen, Li Li, Ji Zhou, Jian Tao, Xicheng Tan, Guangsong Fan
CUDA and OpenCL are two general-purpose programming interfaces for GPU platforms. As early as 2006, NVIDIA proposed a programming interface for the NVIDIA GPU, namely CUDA (Harish and Narayanan 2007)). Although CUDA is a general-purpose computing architecture based on the new parallel programming model and instruction sets, it can be used to solve many complex computational tasks (Chen et al. 2013; Yan et al. 2014; Deng et al. 2015). However, because GPUs from different vendors are not compatible with each other, parallel algorithms implemented using CUDA cannot directly run on other vendors’ GPUs. In addition, CUDA applications cannot take full advantage of heterogeneous systems that may have a mixture of CPUs, GPUs, digital signal processing (DSP) units, field-programmable gate arrays (FPGAs), chips, and other components. To overcome the limitations of the CUDA platform, Apple Corporation released a heterogeneous programming framework called OpenCL (Open Computing Language (Yamagiwa 2012)) in 2008. As an open and free industry standard framework, OpenCL has been supported by many vendors. OpenCL applications can run on CPUs, GPUs, DSPs, and FPGAs because most vendors of these processors support OpenCL standards. OpenCL has been developed to the latest version (2.X) under a technical team with developers from IBM, Intel, NVIDIA, and Apple and is currently maintained by the Khronos Group, an international organization dedicated to developing an open standard API.
Block-structured compressible Navier–Stokes solution using the OPS high-level abstraction
Published in International Journal of Computational Fluid Dynamics, 2016
Satya P. Jammy, Gihan R. Mudalige, Istvan Z. Reguly, Neil D. Sandham, Mike Giles
To evaluate the performance at various grid sizes, the Shu–Osher test case is scaled to higher grid sizes. The simulations are performed until t = 1.8s and the total runtime of the solver on various architectures is reported in Table 2. The CPU and GPU simulations are performed on Intel CPU and NVIDIA GPU respectively, with details given in Table 3. From the total runtime of the simulations, it can be concluded that the MPI parallelisation performs better than OpenMP parallelisation by a factor of 1.5. For OpenCL and CUDA implementations on the GPU, CUDA performs better than the OpenCL implementation.