Explore chapters and articles related to this topic
Parallel Computing Architecture Basics
Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019
Over the years, parallel-processing paradigms have evolved along with appropriate architectures. There are three main categories of parallel architectures: shared memory parallel architecture, distributed memory parallel architecture, and parallel accelerator architecture. In shared memory parallel architecture, the main memory is shared by all the processing elements (e.g., cores in the CPU). On the other hand, in distributed memory parallel architecture, the main memory is distributed to different nodes and each processing element can only access part of the main memory. In the past decade, parallel architectures have quickly evolved into new types of parallel accelerators, such as general-purpose graphics processing units (GPGPUs) and Intel Xeon Phi.
Hardware implementation
Published in Tomoyoshi Shimobaba, Tomoyoshi Ito, Computer Holography, 2019
Tomoyoshi Shimobaba, Tomoyoshi Ito
In addition to FPGA and GPU architectures, computer holography has been implemented on a Xeon Phi (Figure 6.16) and Greatly Reduced Array of Processor Element with Data Reduction (GRAPE-DR) processors. Recently, Intel released the Xeon Phi processor that installs multiple x86-based processors on one chip. Reference [174] evaluated the performance of a Xeon Phi processor and compared diffraction and CGH calculations performed by Xeon Phi, CPU, and GPU. The GRAPE-DR processor is a multicore processor with 512 processor elements. The performance of GRAPE-DR in CGH calculations has been evaluated in Reference [175].
Parallelizable adjoint stencil computations using transposed forward-mode algorithmic differentiation
Published in Optimization Methods and Software, 2018
J.C. Hückelheim, P.D. Hovland, M.M. Strout, J.-D. Müller
This new method becomes particularly relevant in the presence of the current trend toward massively parallel computations at low clock speeds [4]. Caused by a stall in achievable processor clock rates [27], as well as a growing concern for energy-efficient computing [12], simulation programs are increasingly using shared-memory parallelism to run efficiently on new, more power-efficient multicore and many-core architectures [23,30] such as Xeon Phi or Nvidia Tesla. A widely used programming model to implement software for these new machines is OpenMP [10]. The reverse differentiation of distributed-memory parallel code using MPI has been a matter of research for a long time [20,25,28,34] and is mature enough to be used in practice [32]. However, AD tools have largely ignored shared-memory parallelization and focused mostly on forward- or parallel vector-mode differentiation [5–7], or resorted to conservative parallelization approaches with critical sections or atomic operations for the adjoint code, thus reducing its scalability [15,16].