Data Oriented Design in the Frontend — Part 2: Data Locality

Pablo Weremczuk
8 min readApr 15, 2020

--

We are going to discuss the most important concept regarding DOD. Data Locality.

A Programming Phenomena

First, we are going to see a programming phenomenon, and then we are going to see why it happens.

So, pretend we have a matrix, and we want to visit every element on the matrix, we are going to do it in two different ways: Iterating row by row, and iterating column by column.

Those ways to iterate a matrix have a name: Row Mayor Traverse, and Column Mayor Traverse.

BUT! since I found these names hard to remember, we are going to call these two ways to traverse the matrix "Cache-Friendly" and "Cache-Unfriendly".

There, easier to remember, isn't it? Now, back to business.

The code to traverse the matrices looks like this in javascript:

// ROW MAYOR TRAVERSE (cache friendly)
for(let i = 0; i < MATRIX_SIZE; i++){
for(let j = 0; j < MATRIX_SIZE; j++){
result += arr[j][i] ;
}
}

For the second way

// COL MAYOR TRAVERSE (cache unfriendly)
for(let j = 0; j < MATRIX_SIZE; j++){
for(let i = 0; i < MATRIX_SIZE; i++){
result += arr[j][i] ;
}
}

The code is almost equal, the change is so small that I had to check several times to be sure that I didn't mix it up.

Even is the code is so similar, the Row Mayor Traverse version is 19 times FASTER!!.

"Oh, that's because javascript is a horrible language". Well, look what happens when I try the same algorithm in C#, Java, and C.

When we iterate in row-by-row instead of column-by-column, we have an order of magnitude performance gain!

"Oh, but who iterates an array like that?!?!"

Well, before answer that, let's see first WHY this happens.

When the Speed of Light is not Fast Enough

Casey Muratori, explains very well this concept on his video Intro to C on Windows, but the video is a little old and I’d like to update the concepts that he explains there.

The biggest problem today is the time you need to get information from the memory to the CPU registers. Take a look at this image: This is a photo of a Gigabyte TRX40 Aorus Pro motherboard, designed to host a Ryzen Threadripper 3970.

An AMD Ryzen threadripper is one of the most powerful computers you can buy today in a computer shop. WHAT A ROCKET! But there is a detail in the image I want you to notice. The distance between the CPU and the memory bank is 5.69 cm, which means around 11cm for a round trip.

Distance from the center of the CPU to the center of the memory.

In a hypothetical scenario, where you can send a ray of light, to fetch data from the CPU to the memory(ignoring the time that memory needs to look for the data), we want to know how much the CPU needs to wait to get the data into its registers to process it.

We know this by dividing the speed of light by the number of CPU cycles: 300.000 km/s split by 4.5 GHz. That gives us that the light covers 6.67 cm in one CPU cycle. This is NOT fast enough to bring the data in one CPU cycle! Our data is 11 cm away from the CPU!!. The most powerful CPU you can buy, can not bring data from the memory fast enough.

Of course, we don’t use light in the vacuum to move data from the memory to the CPU, in a straight line, we use electricity through copper wires that need to trace big hooks to avoid electronic components in the motherboard. Also, we have a LOT of additional delay waiting for the memory circuitry to look for the address we request, collect it and send it back. To minimize this problem, the CPU uses cache.

In this context, the cache is a small piece of memory that happens to be very fast. It is inside the CPU and has several levels and sizes.

In the following image, we can see how much the processor needs to wait for the data, depending on which cache is located. Think about “cycles” as the amount of time the CPU can not work on your problem because is waiting for data.

Latency from different levels of cache compared to memory.

When I was working on video games (back in the days when flash was a thing), the square root was a very expensive operation.

The square root is a very common operation because you need it to calculate distance, so we come with very weird tricks to avoid using it. Let's compare how many cycles take the most expensive operations in the CPU.

We can see here, that the floating-point square root takes "only" between 8 and 21 cycles. If we compare it with the 200 cycles we spend bringing some information from the main memory, this operations are not so bad.

Actually, there are very few operations that are close to 200 cycles in the worst-case scenario!

The point here is that it makes more sense to optimize the code in the data access than trying to find better ways to perform operations.

This was nicely explained in detail by Mike Acton on his GDC talk “Data Oriented Design in C++”.

Back to The Matrices

Let's see how this cache thing has anything to do with iterating matrices.

Imagine that we want to bring a single byte from memory. Turns out that the CPU doesn’t work like that. Instead of a single byte, it brings a piece of memory that is the size of your cache. This is called a cache line.

The logic behind this is that the CPU says "oh, you want this byte, maybe you want the following bytes, so I'll bring them to the cache.".

The difference here is that while in the row approach you have one cache miss only when you run out of cache space, in the column approach you have a cache miss in every element!

Scott Meyers explains this in great detail on his talk Cpu Caches and Why You Care.

The bottom line here is that the fastest way to process elements is to put them as close as possible in memory and process them in batch.

Let’s look at some graphics comparing the difference between the two techniques.

Comparison between different languages

You can check the code for the tests I used to create this article in github, in the folder 0001-traversing-a-matrix.

But Who Iterates a Matrix That Way?

Not many people. But if you read carefully, the reason why we are 10 or 60 times slower is that the data we want to use is not close enough in memory.

But what happens if you have a lot of physics objects where you have to update the speed on the Y component based on the gravity? We can do something like this.

class GameObject {
Vector3f position; // 4 x 3 bytes = 12 bytes
Vector3d speed; // 12 bytes
int state; // 4 bytes
float mass; // 4 bytes
Sprite* // 8 bytes
}
...GameObject gos[10000] = createALotOfGO();
...
for(int i = 0; i < 10000; i++){
gos[i].speed.y += gravity * delta_time;
}

This is not a real-world example, but it looks A LOT LIKE ONE. In this case, we are pulling 40 bytes at a time to update 4. We have 36 bytes discarded from the cache. This means that in the most powerful computer we can buy today, we have a cache miss 90% of the time in a trivial loop.

Since most of the software nowadays looks like this, we can say that most of the software is tenths of time slower than it can be.

How does This Compare With Other Optimizations?

Just to have an idea that how this compares with other techniques that we use to achieve performance, I created some tests in C, and I compile both, in binary under macOS and in Webassembly using Emscripten.

The first uses SIMD to process an addition on several elements. SIMD is a way to load several data units in unusually long registers, and perform an instruction over that register, which operates over all the data units at the same time. This is where it gets its name: Single Instruction, Multiple Data.

We can see that the code that uses SIMD is 3x faster.

Now, let’s see the performance gain with our old friend: multithreading.

Multi-Threaded version is around twice as fast as the sequential version.

If you compare these graphics with the performance we have using the cache in an efficient way, they look like a joke. We are comparing 60x against 3x.

Coding for cache efficiency gives us way more performance improvements than avoiding square roots, using SIMD or using multi-threaded techniques.

You can check the code for the tests I used to create this article in github, in the folder 0003-SIMD.

Conclusions and Takeaways

Wow! that was a lot of land to cover! Let’s make a quick recap of what we saw today:

  • If you trash your cache, your software will be very slow.
  • The way to be FAST is to tightly pack the data together and process it in batch.
  • Most of the software nowadays should be tens of times faster.
  • Coding for cache efficiency is the most meaningful way to achieve performance.

In the next article, we are going to see what options we have to code for cache efficiency. See you then.

--

--

Pablo Weremczuk

Past: Vox Populi, Vox Dei(a werewolf thriller). Working in a company created in 1690. Mixing low level programming with web development using Webassembly.