Ten Advanced Optimizations of Cache Performance

5 min readFeb 25, 2019

This blog talks about 10 different optimizations for caches and the content of this blog is based on the book Computer Architecture - A Quantitative Approach.

Compiler Optimizations to Reduce Miss Rate

Compiler optimizations are one of the obvious ways to reduce miss rate. A basic rule of thumb should be, access the memory that is contiguous. Since 2d-arrays are stored in memory in a row major order so we should not access them in column major order because this reduces spatial locality. One of the compiler optimizations is known as the blocking technique. This optimization improves temporal locality in the accesses to a 2d-array. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks. Consider the following code for matrix multiplication:

for(i = 0; i < N; i = i + 1)
{
    for(j = 0; j < N; j = j + 1)
    {
        r = 0;
        for(k = 0; k < N; k = k + 1)
        {
            r = r + y[i][k] * z[k][j];
        }
        x[i][j] = r;
    }
}

Now since the cache cannot hold all the three N-by-N matrices. So, these scattered accesses lead to misses from the first level cache. In the worst case, there would be 2N³ + N² memory words accessed for N³ operations. To ensure that the elements being accessed can fit in the cache, the original code is changed to compute on a submatrix of size B by B. Two inner loops now compute in steps of size B rather than the full length of x and z.

for(jj = 0; jj < N; jj = jj + B)
{
    for(kk = 0; kk < N; kk = kk + B)
    {
        for (i = 0; i < N; i = i + 1)
        {
            for (j = jj; j < min(jj + B, N); j = j + 1)
            {
                r = 0;
                for (k = kk; k < min(kk + B, N); k = k + 1)
                {
                    r = r + y[i][k] * z[k][j];
                }
                x[i][j] = x[i][j] + r;
            }
        }
    }
}

The total number of memory words accessed is 2N³/B + N². This total is an improvement by about a factor of B. Hence, blocking exploits a combination of spatial and temporal locality, since y benefits from spatial locality and z benefits from temporal locality. Following is the performance analysis of the two versions over different block sizes and matrix sizes

Small and Simple First-Level Caches to Reduce Hit Time and Power

The pressure of both a fast clock cycle and power limitations encourages limited size for first-level caches. And, the use of lower levels of associativity can reduce both hit time and power consumption. Lower levels of associativity will usually reduce power because fewer cache lines must be accessed.

Way Prediction to Reduce Hit Time

In way prediction, extra bits are kept in the cache to predict the way, or block within the set of the next cache access. This prediction means the multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle in parallel with reading the cache data. A miss results in checking the other blocks for matches in the next clock cycle.

Pipelined Cache Access to Increase Cache Bandwidth

The critical timing path in a cache hit is the three-step process of addressing the tag memory using the index portion of the address, comparing the read tag value to the address, and setting the multiplexor to choose the correct data item if the cache is set associative. This optimization is simply to pipeline cache access so that the effective latency of a first-level cache hit can be multiple clock cycles, giving fast clock cycle time and high bandwidth but slow hits.

Nonblocking Caches to Increase Cache Bandwidth

The processor needs to stall on a data cache miss for pipelined computers that allow out-of-order execution. A nonblocking cache or lockup-free cache escalates the potential benefits by allowing the data cache to continue to supply cache hits during a miss. This “hit under miss” optimization reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the processor.

Multi-banked Caches to Increase Cache Bandwidth

Rather than treat the cache as a single monolithic block, we can divide it into
independent banks(as done in DRAM)that can support simultaneous accesses. To spread the accesses across all the banks, a mapping of addresses to banks that works well is to spread the addresses of the block sequentially across the banks.

Critical Word First and Early Restart to Reduce Miss Penalty

This technique is based on the observation that the processor normally needs just one word of the block at a time.
Critical word first: Request the missed word first from memory and send it to the processor as soon as it arrives; let the processor continue execution while filling the rest of the words in the block.
Early restart: Fetch the words in normal order, but as soon as the requested
word of the block arrives send it to the processor and let the processor continue execution.

Merging Write Buffer to Reduce Miss Penalty

Write-through caches rely on write buffers, as all stores must be sent to the next lower level of the hierarchy. While filling the buffer, the addresses can be checked to see if the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined with that entry.

Hardware Prefetching of Instructions and Data to Reduce Miss Penalty or Miss Rate

This approach prefetches items before the processor requests them. Both instructions and data can be prefetched, either directly into the caches or into an external buffer that can be more quickly accessed than main memory.

Compiler-Controlled Prefetching to Reduce Miss Penalty or Miss Rate

An alternative to hardware prefetching is for the compiler to insert prefetch instructions to request data before the processor needs it. There are two flavors of prefetch, Register prefetch will load the value into a register and Cache prefetch loads data only into the cache and not the register.

Ten Advanced Optimizations of Cache Performance

Written by Hritvik Taneja