Explore chapters and articles related to this topic
Parallel Systems
Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019
Memory can either be shared or distributed (Figure 6.2a and b). Shared memory is typically composed of one of the following two architectures: uniform memory access (UMA) or nonuniform memory access (NUMA). Regardless of the specific architecture, shared memory is generally accessible to all processors in a system, and multiple processors may operate independently and continue to share the same memory. Figure 6.3a illustrates the UMA architecture. Multiple CPUs are capable of accessing one memory resource. This is also referred to as a symmetric multiprocessor (SMP). In contrast, NUMA is often made by physically linking two or more SMPs: one SMP is capable of directly accessing the memory of another SMP (see Figure 6.3b).
Distributed Systems
Published in Vivek Kale, Agile Network Businesses, 2017
Memory can either be shared or distributed (Figure 4.1b). Shared memory typically is comprised of one of the following two architectures: uniform memory access (or UMA) or non-uniform memory access (or NUMA). Regardless of the specific architecture, shared memory is generally accessible to all the processors in the system and multiple processors may operate independently and continue to share the same memory. Figure 4.1b illustrates the UMA architecture. Multiple CPUs are capable of accessing one memory resource. This is also referred to as a symmetric multiprocessor (SMP). In contrast, NUMA is often made by physically linking two or more SMPs. One SMP is capable of directly accessing the memory of another SMP.
Parallel Architectures
Published in Pranabananda Chakraborty, Computer Organisation and Architecture, 2020
In fact, multiprocessors are classified by the organisation of their memory systems (shared memory and distributed shared memory) as well as by the interconnection networks (dynamic or static) being used. Centrally-shared memory multiprocessors, also known as UMA (uniform memory access), use a limited number of processors located relatively closely, and has a single address space. They are often called tightly-coupled multiprocessor or sometimes referred as parallel processing system. They are mostly not scalable, or may be scalable to a very limited extent, constrained by the bandwidth of the shared memory. If all the CPUs in this system are made identical and each CPU is allowed to execute either OS code or user program, then this system is called Symmetric Multiprocessor (SMP). Any communication between the processors in this system usually takes place through the shared memory. When a message is sent from one processor to another, the delay experienced is short, and the data rate is high, since the CPU chips are likely to be placed on the same printed circuit board and connected by wires etched in the board. Although these systems can be employed for general-purpose multiuser applications, they tend to be used to work more on a single program (or problem), which is already subdivided into a series of multiple subtasks for their parallel execution, using different resources simultaneously to achieve maximum speed up. The other type is distributed-shared memory (DSM) multiprocessors, which are often called loosely-coupled multiprocessor, or sometimes also known as Scalable shared memory architecture, and may also be referred to as distributed computing systems. These multiprocessors, on other hand, use NUMA (Non Uniform Memory Access) mechanism to access physically-separated memories (local, global, and remote) that can be addressed as one logically shared address space. However, multiprocessor systems are best suited for general-purpose multiuser applications where the major thrust is on programmability. Shared-memory multiprocessors can form a very cost-effective approach, but latency tolerance while accessing remote memory is considered a major shortcoming. Lack of scalability is also a serious limitation of such a system.
Implementation of the parallel mean shift-based image segmentation algorithm on a GPU cluster
Published in International Journal of Digital Earth, 2019
Fang Huang, Yinjie Chen, Li Li, Ji Zhou, Jian Tao, Xicheng Tan, Guangsong Fan
Although parallelization of the serial mean shift algorithm achieved a high speedup ratio on heterogeneous platforms, the time consumption of this algorithm on a single GPU system is still a major barrier to making it really useful (Li and Xiao 2009; Zhou, Zhao, and Ma 2010), especially when the data volume reaches a certain level. In particular, when detecting changes in multi-temporal RS images, a single GPU is just not powerful enough to finish the task in a timely manner. Because of the excellent cost-to-performance ratio of GPU-based heterogeneous systems, in recent years, many researchers began to carry out research on GPU clusters. For instance, Zhang et al. (2010) used an MPI + OpenMP + CUDA hybrid programming model to accelerate high-resolution molecular dynamics simulations of proteins on an eight-node GPU cluster and achieved very good speedup. To solve data localization problems in a large-scale parallel fast Fourier transformation (FFT) algorithm, Chen et al. implemented the Peking University FFT (PKUFFT) algorithm that transformed 512GB 3D on a 16-node GPU cluster. In comparisons with the FFTW and Intel MKL libraries, Chen, Cui, and Mei (2010) achieved speedups of 24.3 and 7, respectively. In a study on parallel programming interfaces for GPU clusters, Fan, Qiu, and Kaufman (2008) proposed a GPU cluster common programming framework called Zippy in 2008. Zippy used the GA (Nieplocha et al. 2006) library, Cg libraries, OpenGL libraries, and CUDA to achieve non-uniform memory access and two-level parallel mechanisms to resolve data inconsistencies in GPU memory across the cluster. In 2009, Lawlor developed a cudaMPI library for general-purpose computing using the MPI + CUDA hybrid programming model (Lawlor 2009). This library provided an application programming interface (API) similar to that of the MPI communication interface, but was available for NVIDIA GPU clusters only. Kim et al. (2012) proposed a common programming framework for CPU/GPU heterogeneous clustering using MPI + OpenCL hybrid programming to reduce the complexity of GPU cluster programming and solve the problem of poor maintenance and poor portability of GPU cluster applications. SnuCL, by packaging MPI communication functions with an OpenCL API, provides users an AP that supports GPU cluster communication (Kim et al.2012). SnuCL scales well in small and medium-sized GPU clusters.