Explore chapters and articles related to this topic
Lattice Boltzmann Method
Published in Young W. Kwon, Multiphysics and Multiscale Modeling, 2015
For the LBM calculations conducted on the CPU, it is most appealing to use the AoS because this will allow the CPU to access sequential memory locations while accessing the data for a particular lattice point. This allows for efficient memory transfers as well as effective use of the memory cache hierarchy. In contrast, most LBM implementations on the GPU use the SoA approach. With the SoA, when data are loaded from memory within a kernel, each thread in a given warp reads from consecutive memory locations, as illustrated in Figure 3.36. When loads are coalesced in this fashion, the data are transferred from memory in a single transaction. A similar condition exists during store operations as well. This is in conformance with the guidance to ensure coalesced memory access. As a rule of thumb, using the AoS approach on the GPU penalizes achievable performance by a factor of approximately two.
Shared Memory Architecture
Published in Vivek Kale, Parallel Computing Architectures and APIs, 2019
Most desktop processors have a three-level cache hierarchy, consisting of a first-level (L1) cache, a second-level (L2) cache, and a third-level (L3) cache. All these caches are integrated onto the chip area. For the L1 cache, split caches (i.e., an instruction cache to store instructions and a separate data cache to store data) are typically used; for the remaining levels, unified caches are standard.
Write energy reduction of STT-MRAM based multi-core cache hierarchies
Published in International Journal of Electronics Letters, 2019
Figures 8 and 9 depict static and dynamic energy of STT-MRAM-based L2 cache. ‘In the lower technology node, more than 90% of the total energy consumption is static’. The read energy of STT-MRAM is almost equal to the read energy of SRAM, but total write energy of STT-MRAM is much higher than SRAM. We have used write buffer to minimise total number of write in L1 and L2 cache by virtue of this the total dynamic energy is reduced and it is very near to the SRAM. The major benefit of using STT-MRAM is we have saved more than 90% static energy for L2 cache as shown in Figure 8. This makes STT-MRAM-based cache memory a very captivating option to optimise total energy. But due to large write energy and latency, the dynamic energy of STT-MRAM-based cache is 12.53% and 26% higher than the SRAM-based Cache hierarchy for PARSEC and SPLASH-2 benchmarks, respectively. Our design improves the impact of STT-RAMs high write latency and energy on dynamic power and reduces the overall energy consumption compared to a pure SRAM-based design. Figure 10 shows the normalised execution time of PARSEC and SPLASH-2 benchmarks for the four X86 core architecture. We have implemented the cache architecture with two schemes: a baseline case using a SRAM-based L1 and the L2 cache with write buffers and a second with a STT-MRAM-based L1 and L2 cache with SRAM-based write buffers. The execution time obtains for STT-MRAM-based cache hierarchy are normalised to the SRAM-based cache hierarchy. Observing Figure 10, we can see performances of SRAM- and STT-MRAM-based cache hierarchy are similar for most of the benchmark. The normalised execution time is improved by 19.33% as compared to SRAM-based cache hierarchy.
Optimization strategies for GPUs: an overview of architectural approaches
Published in International Journal of Parallel, Emergent and Distributed Systems, 2023
Alessio Masola, Nicola Capodieci
A typical GPU usually has three levels of data caches. A third-level LLC (Last Level Cache) is usually observable in integrated SoCs. For instance, in recently released NVIDIA embedded GPUs L3 is partitioned among the CPU and the iGPU and it is mostly used as a victim cache [9]. In Intel GPUs, L3 is a general-purpose low-latency memory that is a part of the overall coherence domain with the CPU host [10], hence it can be dynamically configured to be used as a shared cache among the computing clusters and/or the CPU. The application developer's ability to re-configure a level of GPU cache is also observable in NVIDIA devices, although at a higher level within the cache hierarchy, i.e. the L1 data cache. More in detail, the application developer targeting NVIDIA hardware can choose to configure a variable percentage of each compute cluster's L1 to act as a scratch-pad memory, meaning that a variable portion of the in-chip cache can be directly indexed, as opposed to the transparent way in which the hardware mediates memory accesses through caching. From the point of view of programming, relying only on a larger L1 is definitely easier, whereas explicitly orchestrating memory accesses within the shared (scratch-pad) memory requires a more involved coding style. In this context, Li et al. [11] published a study in which performance and power metrics are observed within a restricted set of simulated benchmarks: the goal is to understand the tradeoff between hardware- and software-managed memory accesses. According to the specific task at hand, a proper usage of shared memory leads to better performance and significantly less power consumption. It is also important to highlight that shared memory not only requires significantly more implementation efforts, but it also requires to be carefully dynamically sized. This is because the amount of shared memory per compute cluster and therefore available for each block of threads executing in parallel accounts for what is known in literature as the block occupancy factor. Therefore, a carelessly sized shared memory can limit the number of parallel threads scheduled simultaneously within the same compute cluster [12].