Explore chapters and articles related to this topic
Shared Memory Architecture
Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019
A shared memory system is composed of multiple processors and memory modules that are connected via some interconnection network. Shared memory multiprocessors are usually bus based or switch based. In all cases, each processor has equal access to the global memory shared by all processors. Communication among processors is achieved by writing to and reading from the memory. Synchronization among processors is achieved using locks and barriers. The chapter discussed the main challenges of shared memory systems, namely, contention and cache coherence problems. Local caches are typically used to alleviate the bottleneck problem. However, scalability remains the main drawback of a shared memory system. The introduction of caches creates consistency problems among caches and between the memory and caches. Cache coherence schemes can be subdivided into snooping protocols and directory-based protocols.
Interprocess Communication Primitives in POSIX/Linux
Published in Ivan Cibrario Bertolotti, Gabriele Manduchi, Real-Time Embedded Systems, 2017
Ivan Cibrario Bertolotti, Gabriele Manduchi
If processes are created to carry out collaborative work, it is necessary that they share memory segments in order to exchange information. While with threads every memory segment different from the stack was shared among threads, and therefore it suffices to use static variables to exchange information, the memory allocated for the child process is by default separate from the memory used by the calling process. We have in fact seen in Chapter 2 that in operating systems supporting virtual memory (e.g., Linux), different processes access different memory pages even if using the same virtual addresses, and that this is achieved by setting the appropriate values in the Page Table at every context switch. The same mechanism can, however, be used to provide controlled access to segments of shared memory by setting appropriate values in the page table entries corresponding to the shared memory pages, as shown in Figure 2.8 in Chapter 2. The definition of a segment of shared memory is done in Linux in two steps: A segment of shared memory of a given size is created via system routine shmget();A region of the virtual address space of the process is “attached” to the shared memory segment via system routine shmat().
Efficient dynamic detection of data races for multi-core software
Published in Amir Hussain, Mirjana Ivanovic, Electronics, Communications and Networks IV, 2015
Definition 1 Given two access events et and eu to a shared memory location from two distinct thread segments t and u respectively, if the two events are not synchronized (i.e. neither Ct⊑Cu' nor Cu'⊑Ct' ) and at least one of the events is a write, there exists a data race between et and eu
Development and Optimisation of a DNS Solver Using Open-source Library for High-performance Computing
Published in International Journal of Computational Fluid Dynamics, 2021
Hamid Hassan Khan, Syed Fahad Anwer, Nadeem Hasan, Sanjeev Sanghi
Direct Numerical Simulation of turbulence is a highly expensive computational task. Therefore, the parallel computation of multiple processes reduces the computational burden. The architecture of the parallel computers is categorised as shared and distributed memory systems. The shared memory system employs interprocessor communication through global shared memory, as shown in Figure 3(a). The programing on the shared-memory architect using OpenMP is simple, with no sharing of data on account of all processors are accessed with the same memory. The OpenMP directives are used for loop parallelisation, and the other parts are sequential. However, the disadvantage of the shared-memory system is poor scalability because with an increase in processors, the memory access time increases. Indeed, the distributed-memory system consisting of multiple nodes (computer) is the most common architect to simulate computational fluid dynamics problems. The distributed-memory showed in Figure 3(b) represents an autonomous processor with an individual memory bus, while the data are explicitly shared between processor by the message-passing network. Despite the parallelisation complexity of distributed-memory, the significant advantage of the superior scalability makes the architect more attractive for high-performance computing. The programing implementation on the distributed memory system is known as the message-passing interface (MPI).
Implicit discrete ordinates discontinuous Galerkin method for radiation problems on shared-memory multicore CPU/many-core GPU computation architecture
Published in Numerical Heat Transfer, Part B: Fundamentals, 2021
During the past decades, computation devices have gone through revolutionary developments. With clock speed per chip pushing closely to its limit, increasing the number of processing units has become the major means of promoting hardware performances these days. Modern computers, including not only workstations and computation nodes of high performance computing (HPC) clusters but also personal computers, such as laptops and desktops, are all equipped with multiple central processing units (CPUs) (typically in the range of 4–32). All the CPU cores within the same machine have the same access to all the data and share the global memory, forming a shared-memory system. Aside from CPUs, modern computers are usually equipped with one or more graphic processing units (GPUs) as the accelerator. Since each GPU has thousands of processing threads, it is essentially of a many-core architecture. The prevailing architecture for modern computers therefore is shared-memory, multicore CPU/many-core GPU heterogeneous.
A GPU-Accelerated Filtered Density Function Simulator of Turbulent Reacting Flows
Published in International Journal of Computational Fluid Dynamics, 2020
M. Inkarbekov, A. Aitzhan, A. Kaltayev, S. Sammak
Here, l defines the id of particles, XP and YP determine the locations of particles. MY_ID_I and MY_ID_J are the functions that return values of I and J according to XP and YP, respectively. Lines 4–5 are the implementation of Equation (14), where UP is the interpolated value of the particle l, U is the modes of the interpolating function (e.g. velocity) and P is the Legendre function. To calculate each value of UP, it is necessary to load from global memory values of U, XP and YP. To reduce the number of accesses to global memory, XP and YP are read from global memory into Shx and Shy that are defined in local memory which are located in L1 cache. It can be also noticed that values of U are read from global memory times. That is the average number of particles per DG element. For efficient access to global memory, one can take advantage of shared memory. In this case, Nm values of U are copied into shared memory, so all the threads of the block have fast access to this data.