Nomad Game Engine: Part 4.3 — AoS vs SoA

Optimizing data storage

In the next couple blog posts, we’re going to take a dive off the deep end. I’ll leave this as a warning here: none of this is actually necessary for an ECS. The performance difference in the vast majority of games will be negligible and it ultimately adds a lot of complexity to the inner workings of the component manager. If your goal is simply to build an ECS and get it working, skip the rest of the 4.x Nomad Engine blog posts.

That said, the information in these posts is *incredibly cool* and very unique. You will be introduced to the joys of template metaprogramming, with very concrete examples because of the fact that we’re building a very concrete system: a game engine. If you want to have your mind expanded, read on!

The goal of this blog posts is to set up the reasoning behind the next two posts by providing a high level architecture and explanation.

Array of Structs vs Struct of Arrays

Let’s take a quick look back at our implementation for our first iteration of the “generic component manager”. Our storage struct looked like this:

struct ComponentData {
unsigned int size = 1;
std::array<ComponentType, 1024> data;
}

Notice that our data is being stored as an array of structs (in this case, the struct is the ComponentType). For the rest of this blog post, we’re going to use an example component that we’ve encountered, but add a bit of extra “stuff” to it:

struct Transform {
int x;
float y;
double z;
}

Let me say this up front: A real component would obviously not look like this. The only reason we’re doing this is so that our examples make more sense. Most components will in fact have different types of members, so this will simulate it without making their names too complex.

What is “SoA” storage?

Struct of Array storage flips common programming convention on its head. Instead of storing an array of structs, a struct is created with an array for every element of the original struct. This concept is best illustrated with an actual example.

//Array of structs
struct Transform {
int x;
float y;
double z;
}
std::array<Transform, 1024> aos;
//Struct of arrays
struct soa {
std::array<int, 1024> x;
std::array<float, 1024> y;
std::array<double, 1024> z;
}

Why is SoA storage better?

Let me clarify before answering this that it’s not actually always better. Whether SoA or AoS storage is better is mostly a matter of how the component is used.

//Probably best as AoS storage
struct Transform {
float x;
float y;
float z;
}

Array of Structs storage is better when a component’s fields are all generally accessed at the same time. For example, we might have a Transform component with x, y, and z values — each of these would likely be accessed together, as there aren’t that many situations that we would want only the x value of a component.

//Probably best as SoA storage
struct Health {
float currentHealth;
float maxHealth;
float healthRegen;
int shieldAmount;
int shieldModifier;
bool isImmune;
}

Struct of Arrays storage is superior when a component has multiple fields in it that might be accessed individually. For example, if we take a look at our Health component that we defined up above, it’s likely that different systems would be updating different parts of the component by themselves. A system might be in charge of updating the component’s health based on its healthRegen value, and another system might handle the immunity control. When the entity takes damage, whichever system was in charge of that would need to access currentHealth, shieldAmount, and isImmune, but won’t need to care about maxHealth or healthRegen. In situations like these, Struct of Array storage is better.

Here are some gifs of storage access patterns for each of these two scenarios to illustrate this point:

Example 1

We’re going to execute the following code on the Transform component and compare the memory access pattern between AoS and SoA:

Once again, if you can’t see the gist inline because you’re on iOS, try opening it in browser (Medium: fix your stuff).

For this example, we will assume that no entities are out of bounds (this will probably be the most likely case considering this code will get run every tick). This means that our memory accesses look like this:

Array of Structs memory accesses
Struct of Arrays memory accesses

You’ll notice that both of them essentially read memory sequentially. This is great for performance because it means fewer cache misses. Depending on the size of our various levels of cache, the AoS solution might be slightly better, but overall they both access data in a sequential manner, which means our processor will be able to make the most use of our cache.

Example 2

Now instead let’s take another look at our health component that we introduced earlier in the post. We’re going to implement the aforementioned “regen system” that increases entities health based on their health regen. Here’s the code:

Now let’s take a look at the memory access pattern based on how we store the component:

Array of Structs memory accesses
Struct of arrays memory accesses

You’ll notice here that the AoS storage version has a ton of empty space! This is because we’re only accessing two fields on the component, so the rest are being pulled into the cache for no reason. The SoA version continues to have great cache performance because only the component fields that are needed are pulled in.

Math time!

While the visualizations may have convinced you by themselves, if you’re still not convinced, let’s do some math!

Most computers have cache line sizes of 64 bytes . Let’s assume that our Health component is as we defined it up above, which would mean it would take up:

float(4 bytes) * 3 = 12 bytes
int(4 bytes) * 2 = 8 bytes
bool(1 byte) * 1 = 1 byte

I’ll address the actual space that this struct would take up below (due to struct padding), but let’s assume that the struct is then 12 + 8 + 1 = 21 bytes.

Let’s assume we’re executing our code above for health regeneration. If we’re storing the struct using AoS storage, every cache line we pull in will contain 3 full components (64 / 21 ~= 3).

Let’s pretend our game has 10,000 entities on the screen that are all regenerating health every tick. If we’re using AoS storage, that means we would be pulling in 10,000 / 3 ~= 3300 cache lines.

If we were using SoA storage, we would be loading two tightly packed arrays of currentHealth and healthRegen values. Each cache line would contain 64 / 4 = 16 components worth of values, but we would have to pull in two lines — one for currentHealth and one for healthRegen. An easier way to think about it is we’re loading two arrays of 10,000 floats. Assuming we can fit 64 / 4 = 16 floats into a cache line, that would mean 10,000 / 16 ~= 625 cache lines for each of our two arrays. To execute our code on all 10,000 entities, we would end up pulling in 625 * 2 = 1250 cache lines, or about 1/3 of the lines when compared to AoS

A note on struct padding

You’d assume in the above example that healthRegen would take up 12 + 8 + 1 = 21 bytes of space, but really it takes up 24, due to struct padding. This is actually an extra benefit of SoA storage — because each of the members are stored in tightly packed arrays on their own, we won’t lose any space to struct padding.

Is this really necessary?

See the first paragraph of this blog post, but it’s a valid concern. Will this give us that much of a performance increase?

To answer this, ultimately it comes down to cache misses. If you are making lots of memory accesses (lots of entities on screen), cache friendly data becomes more and more important. Let’s use this post from 2012 to see if we can ballpark some numbers in theory.

The Intel Xeon 5500 i7 has an L2 cache access time of ~10 cycles, while an L3 cache access takes about 40 cycles — or about 4 times as long. If we end up with enough components (or large enough components), using struct of array storage instead of array of struct storage could significantly speed up some batch operations.

Note: Don’t take the last paragraph to mean that our sequential accesses vs random accesses will make our code 4x faster. This is not the case due to how the DRAM/processors handling caching. See this SO post for a better explanation of the details.