Understanding Cache Memory: From Basics to Advanced — Part 1

Cache and Carry
7 min readMay 24, 2023

--

Introducing Cache and the Memory Hierarchy

In the realm of technology, user demands are often challenging to fulfill. They want unlimited amounts of fast memory in their devices. This is in itself a paradox, fast memory is expensive, and slow memory is cheap. Is there a way to make slow memory appear fast?

This is where the concept of cache memory comes into play. Acting as a small, high-speed component situated close to the CPU, cache memory serves as a mediator, making main memory appear faster to meet user demands. But how does cache memory accomplish this feat? Let’s delve into the intricacies of this solution…

Memory Hierarchy

It takes around 100 CPU cycles to access main memory, which goes against our desire of having a small CPI (cycles per instruction). How can we reduce this latency. It is not feasible to store all the program data in the register file, as it would turn the small, fast registers back into large, slow memory. Memories cannot be both large and fast, there needs to be something in between to reduce the effects of memory latency.

It is cache that is used to do this. Cache is a small amount of fast memory that stores copies of instructions and data from frequently used memory locations.

Cache Basics

Cache is founded on the locality principles: temporal and spatial.

Temporal locality says that recently used items tend to be used again soon, think loop increments, and thus guides the cache replacement policy, i.e. what to replace when the cache is full.

Spatial locality, on the other hand, suggests that if data is used, data close to that item in memory will be used soon e.g. array traversal. Therefore, multiple pieces of data should be fetched in one transaction.

Cache is organised into blocks of one or more words, a block is the smallest bit of memory that can be stored in the cache. A ‘cache hit’ occurs when the CPU finds the required block in cache, otherwise a ‘cache miss’ occurs and the block needs to be fetched from memory.

A few metrics are useful for understanding the efficiency of caches. Hit rate is the ratio of cache accesses to main memory accesses. Hit time is the time to access cache plus the time to decide whether it is cache hit or miss. The miss penalty is the time to replace the item in cache plus the time to transfer the data to the CPU.

Cache Levels

The cache hierarchy is described in its distance from the CPU. L1 caches are the closest to the CPU, and thus are the smallest and fastest. L1 caches are usually split into seperate instruction and data caches. Why? Within the CPU, instructions and data are accessed at different parts of the pipeline, and thus it makes sense to split them up to allow for fast accesses.

L2 cache is larger and possibly split between different CPUs. It is typically external to the chip and much larger. If a L3 cache is implemented, it is used to hide the latency between the faster CPU and L1 caches, and the slower L3 cache and main memory.

The L3 cache, if implemented, is even larger. It has a larger latency than lower level caches, but still much lower than main memory. The last level cache is used to reduce latency from memory fetching. It usually has large multi-word blocks (covered later) to reduce memory latency and to make the most of the high memory bandwidth. Starting a memory access is slow, but once started it can provide a large amount of data due to its bandwidth, so large multi-word blocks can take advantage of this. The last level cache is also used to faciliate cache coherence (also covered later) between multiple CPUs.

By Avadlam3 — Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=52362401

Cache Architecture

Each cache entry is tagged using the address of the data word in main memory. The cache then, is searched for the tag when reading or writing an address. This address can either be virtual or physical, depending on the cache implementation.

Cache takes spatial locality further, since neighbouring memory is likely to be used soon, it makes sense to load it into cache together. So entries in cache are not stored as blocks or words, but as lines.

Block sizes are usually a binary power. This reduces the complexity of the circuit as it allows bit masking and shifting to calculate offsets or indexing cache lines.

Cache Placement

Fully Associative Cache

Witin a fully-associative cache, each cache line can hold a copy of any memory location. To access a cache line, the processor has to compare the tags of every cache line with the tag for the request address.

There are caches which are implemented like this, when the cache is really small (a few dozen entries). However, it is generally quite impractical as despite the low miss rates, the cache will be costly and slow as the CPU needs to search everywhere.

Direct-mapped Cache

Within direct-mapped cache, each memory location is mapped to one cache location. The mapping is arbitrary but usually, the cache address = memory address % the number of cache blocks. As a result, multiple memory locations can map to one cache address.

A direct-mapped cache that uses modulo to determine the destination block will be mapped linearly to memory. As a result, it is a good choice for cache that is accessed linearly as any block fetched is likely to be fully used. What kind of memory is accessed sequentially? Instruction memory!

However, as a result direct-mapped caches only work well if the memory addresses used by the program are evenly distrubuted with respect to the bits used for direct mapping. If they are not, and this is usually the case, some cache entries are heavily used and therefore are repeatedly evicted while others are hardly used at all or remain empty. This is known as cache thrashing.

Direct-mapped cache is simple and provides fast accesses, however, also gives high miss rates due to cache conflicts.

Set Associative Cache

This problem can be solved using set associative caches. A set associative cache of M blocks is broken into N sets, where all blocks have the same number of sets. They are often described as n-way set associative.

It combines the good features of fully associative and direct-mapped caches, whilst largely avoiding their weaknesses.

Similarly to direct-mapped, the set which a memory block address M stored is given by M % N. Many different methods are given to decide which block within the set is chosen to hold the data. Some methods include random, FIFO, or least recently used (LRU).

Note that a 1-way set associative set cache is the same as a direct-mapped cache.

Modern processors generally use associativity levels of up to 24 for L2 and L3 caches. L1 caches are usually 8-way set associative.

Multi-word Blocks

A cache uses multi-word blocks to exploit spatial locality. Blocks of multiple words are fetched from memory together as data closed to the data needed now is likely to be needed soon, to reduce miss rate.

When memory content is needed by the processor, the entire cache line is loaded into the L1 data cache. The memory address for each cache line is computed by masking the address value according to the cache line size. For a 64 byte cache line, the low 6 bits are zeroed. The discarded bits are used as the offset into the cache line. The remaining bits are used to locate the line in the cache and as the tag.

When a word is updated, the processor still has to load the entire cache line. It is not possible to for a cache to hold partial cache lines.

Due to this, larger block sizes generally lower the miss rate. However, in smaller caches the opposite is true as the cache can hold fewer blocks, meaning useful data is likely to be evicted and replaced with less useful data. As a result, L1 caches usually have smaller block sizes compared to the last level cache.

Larger block sizes also increase the transfer time between cache and main memory.

Write Policies

When the CPU writes back to the cache, when do we write the modified data to the next level? How do we keep data in cache and in memory consistent? This is required to maintain cache coherency. Two options are write back, and write through.

Write-through

Write through involves updating memory whenever the cache is written. This is very simple, and ensures that all cache levels are consistent.

However, memory write bandwidth can cause a bottleneck. This can somewhat be mitigated by using a write buffer. The processor writes data into the cache and write buffer. Then a memory controller writes the contents of a write buffer to memory. This is fine as long as the store frequency is much lower than the write time, otherwise write buffer saturation occurs. Store buffer overflows when there are too many store instructions in a row, or CPU cycle time is too short with respect to memory speed. The solution to this is to use a write-back cache or a L2 cache.

Write-back

This involves writing to the cache only, and then updating memory when the block is evicted.

Conclusions

In this post we introduced the basics of caches, including why they are used and how they are implemented. Next, we will look at write policies and cache coherency.

Please let me know your thoughts in the comments!

--

--