Explore chapters and articles related to this topic
Caching Technologies
Published in Weidong Wu, Packet Forwarding Technologies, 2007
The 2 × 2 SEs are assumed to operate at 200 MHz. Each visit to an SE thus takes 5 ns (which is possible because on-chip SRAM is employed in SEs and the access time of such SRAM can be as low as 1 ns, if its size is small, as is the case of SEs). Each SE cache also incorporates a victim cache to keep blocks which are evicted from a cache due to conflict misses. A victim cache is a small fully associative cache [16], aiming to hold those blocks that get replaced so that they are not lost. Entry replacement in the victim cache follows a conventional replacement mechanism. When a packet is checked against an SE cache, its corresponding victim cache is also examined simultaneously. A hit, if any, will be in either the cache itself or its corresponding victim cache, but not in both. The victim cache normally contains four to eight entries, and it can effectively improve the hit rates by avoiding most conflict misses.
Memory Organization for Low-Energy Embedded Systems
Published in Christian Piguet, Low-Power Processors and Systems on Chips, 2018
Another class of logical partitioning techniques exploits buffer insertion along the I-cache or the Dcache or both to realize some form of cache parallelization. Such schemes can be regarded as a partitioning solution because the buffers and the caches are actually part of the same level of hierarchy. In this kind of architecture, data and instructions are explicitly replicated, and redundancy is an intrinsic feature of these approaches. Reducing the average cost of a memory access by increasing the cache-hit ratio saves energy. Another solution proposes the use of the buffer as a victim cache that is accessed on a main cache miss. In case of a buffer hit, the line is moved to the cache and returned to the CPU, while the replaced line in the cache is moved to the victim cache. In case of a buffer miss, the lower level of hierarchy is accessed and the fetched datum is copied into the main cache as well, while the replaced line in the cache is moved to the victim cache. In practice, the victim cache serves as an over-full buffer for the main cache. A similar approach has been introduced by Bahar et al. [2], where buffers are used for speculation: every cache access is marked with a confidence level, obtained by the processor state; the main cache contains misses with a high confidence level, while the buffers contains those with a low confidence level. Other techniques adopt a small associative buffer (e.g., 32 entries) in parallel to the L1-cache (called the noncritical buffer), used to “protect” the cache from being filled with noncritical (i.e., potentially polluting) data. Noncritical data are identified at run-time by monitoring the issue rate of the core. An alternative solution consists of filtering the data to be stored in the main cache through a small, highly associative cache close to the L1-cache. Unlike the victim cache (where data are kept before disposing them), the annex cache stores the data read from memory, which are copied into the main cache only on subsequent references to those data.
Optimization strategies for GPUs: an overview of architectural approaches
Published in International Journal of Parallel, Emergent and Distributed Systems, 2023
Alessio Masola, Nicola Capodieci
A typical GPU usually has three levels of data caches. A third-level LLC (Last Level Cache) is usually observable in integrated SoCs. For instance, in recently released NVIDIA embedded GPUs L3 is partitioned among the CPU and the iGPU and it is mostly used as a victim cache [9]. In Intel GPUs, L3 is a general-purpose low-latency memory that is a part of the overall coherence domain with the CPU host [10], hence it can be dynamically configured to be used as a shared cache among the computing clusters and/or the CPU. The application developer's ability to re-configure a level of GPU cache is also observable in NVIDIA devices, although at a higher level within the cache hierarchy, i.e. the L1 data cache. More in detail, the application developer targeting NVIDIA hardware can choose to configure a variable percentage of each compute cluster's L1 to act as a scratch-pad memory, meaning that a variable portion of the in-chip cache can be directly indexed, as opposed to the transparent way in which the hardware mediates memory accesses through caching. From the point of view of programming, relying only on a larger L1 is definitely easier, whereas explicitly orchestrating memory accesses within the shared (scratch-pad) memory requires a more involved coding style. In this context, Li et al. [11] published a study in which performance and power metrics are observed within a restricted set of simulated benchmarks: the goal is to understand the tradeoff between hardware- and software-managed memory accesses. According to the specific task at hand, a proper usage of shared memory leads to better performance and significantly less power consumption. It is also important to highlight that shared memory not only requires significantly more implementation efforts, but it also requires to be carefully dynamically sized. This is because the amount of shared memory per compute cluster and therefore available for each block of threads executing in parallel accounts for what is known in literature as the block occupancy factor. Therefore, a carelessly sized shared memory can limit the number of parallel threads scheduled simultaneously within the same compute cluster [12].