Why Every Developer Needs to Know About Cache Locality for Efficient Code
But first, let’s briefly explore how the CPU accesses data from various memory levels to set the stage for understanding cache locality.
When the CPU needs to access data or instructions, it first checks the L1 cache, which is the fastest and closest to the CPU. If the data isn’t found there, it moves on to check the L2 and then the L3 caches, which are progressively larger but slower. If the data is not present in any of the caches (a “cache miss”), the CPU retrieves it from the main memory (RAM), which is slower. Once the data is fetched from main memory, it is stored in the cache for quicker access in the future. The CPU then uses the data to continue processing its tasks.
For more information, see this article: Cache Memory: CPU’s Best Friend.
What is Cache Locality?
Imagine your desk at work. If you have a bunch of documents that you frequently need to refer to, it’s much faster to keep them within easy reach on your desk rather than having to retrieve them from a filing cabinet each time. This idea of keeping frequently used items close by is similar to how cache locality works in computers.
Types of Cache Locality
Temporal Locality:
- What It Is: This is like having a few documents that you often need to access repeatedly over a short period. Instead of putting them away after each use, you keep them on your desk where you can easily grab them again.
- Why It Matters: In programming, this means that if your code is repeatedly accessing the same piece of data, keeping that data in a fast-access cache (like your desk) rather than retrieving it from slower main memory (like the filing cabinet) speeds things up.
Spatial Locality:
- What It Is: This is like having a stack of documents where you frequently need to read through them in sequence. If the documents are organized in a way that makes it easy to read through them one after another, you’re working efficiently.
- Why It Matters: In programming, this means if your code accesses memory locations sequentially (like reading through an array), it’s beneficial to have those locations close together in memory. This way, when you access one location, nearby locations are likely to be fetched into the cache as well, making future accesses faster.
Why is Cache Locality Important?
When your computer program makes memory requests, it can either be fast (if the data is in the cache) or slow (if the data is not in the cache and has to be fetched from the main memory). By taking advantage of cache locality, you can make sure that frequently accessed data is in the fast-access cache and nearby data is also pre-fetched, thus reducing the time it takes to access the information.
How to Improve Cache Locality:
Design for Temporal Locality:
- Practice: Keep frequently used data readily accessible. For example, in a game, if certain game states or variables are used often, keep them in a fast-access memory space.
Design for Spatial Locality:
- Practice: Organize data in memory so that related pieces of data are stored close to each other. For example, if you have an array of data, access it sequentially rather than jumping around randomly.
Example:
Array Traversal
Imagine a simple loop that processes an array of numbers:
for (int i = 0; i < N; i++) {
process(array[i]);
}
Explanation: In this example, we traverse an array of N
elements sequentially. Due to spatial locality, if array
is stored contiguously in memory, accessing array[i]
means that the data for array[i+1]
is likely already in the cache, leading to faster access times.
Matrix Multiplication
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
result[i][j] = 0;
for (int k = 0; k < P; k++) {
result[i][j] += matrix1[i][k] * matrix2[k][j];
}
}
}
Explanation:
Spatial Locality:
- Matrix1 Access: Accessing
matrix1[i][k]
in this code is generally good for cache locality because it iterates through rows (i
is fixed,k
varies). In row-major order storage (common in many programming languages), elements accessed in this pattern are contiguous in memory, which benefits from spatial locality. - Matrix2 Access: Accessing
matrix2[k][j]
can be problematic. Ifmatrix2
is stored in row-major order, accessing it by columns (j
is fixed,k
varies) can lead to inefficient cache usage. This is because elements of a column are not contiguous in memory, potentially causing more cache misses.
Cache Performance:
- Matrix1: The access pattern of
matrix1
is efficient because it leverages spatial locality, meaning nearby elements are accessed sequentially, which improves cache performance. - Matrix2: The access pattern of
matrix2
can cause inefficiencies because it may lead to scattered memory accesses, especially if the columns ofmatrix2
are accessed in a non-sequential manner.
Optimization:
- Reordering Loops: To improve cache performance, you might optimize the loop order or use techniques like loop blocking (tiling). This helps ensure that elements of
matrix2
that are accessed frequently are loaded into the cache more efficiently, reducing the time spent on cache misses.
Loop tiling (also known as loop blocking) is a program optimization technique whose aim is to improve locality of reference. It modifies the memory access pattern of the loop in order to reuse data already present in the data cache before that data gets evicted and needs to be reloaded from the memory.
Reference : Loop tiling
The given matrix multiplication code demonstrates an example where accessing elements of matrix1
is cache-friendly, while accessing matrix2
can be less efficient due to non-sequential access patterns. By optimizing these access patterns, you can improve cache utilization and overall performance.
In Summary:
Cache locality helps your program run faster by making sure that frequently accessed data is kept close by in fast-access memory. By understanding and applying this concept, you can write more efficient code that takes full advantage of the computer’s caching system.