Is CPU Cache ‘a thing’ in Unity3d?

C# for mono is highly abstracted, with mulitple layers of indirection between the code you write, and the code that is executed on the platform: C# is compiled to Common Intermediate Language, CIL is Ahead-of-time (AOT) and or Just-in-time (JIT) compiled by the Mono runtime into platform-specific machine-code, with optimisations-per-architecture. With the additional consideration of Unity’s own (black-box) IL2CPP replacing the mono runtime and converting the Intermediate Language directly into C++ (which is then compiled to Machine code) things are getting even more gnarly (and this is a simplified description)

All things considered you may find yourself in the camp of people who “learned to stop worrying, and love the bomb.” After all, as Unity developers we have essentially become middleware programmers, shouldn’t we just trust in the engine? However if you have an enquiring mind, like me then you may occasionally wonder how much of the generic wisdom of game-development holds true. In this particular case:

Can you take advantage of CPU caching by using contiguous memory?

Caveat: This is far from an exhaustive test, it targets desktop and according to this post from Unity, IL2CPP may make this redundant, however I find the results interesting and hope you do too:

namespace ExpOne
{
//structs are value types so:
//they are stored contiguously in arrays for spatial locality
//they have a single-level of indirection
public struct Dat
{
//will invalidate the cache / cause a read-miss
public GameObject go;
public Renderer r;

//’HOT’ fields (value types)

//movement
public Vector3 up;
public Vector3 pos;
public Quaternion rot;
public Vector3 scale;
       //health
public float health;
public bool isDead;
}
    public class SimpleProcess : MonoBehaviour 
{
public GameObject prefab;
private Dat[] data;

private int amt = 1000;

// Use this for initialization
void Start ()
{
float seg = (Mathf.PI*2) / amt;
data = new Dat[amt];
for(int i=0; i<amt; i++)
{
GameObject instance = Instantiate(
prefab,
new Vector2(
Mathf.Sin(seg*i),
Mathf.Cos(seg*i)) * 2.0f,
Quaternion.identity) as GameObject;
data[i] = new Dat(){
pos=instance.transform.position,
rot=Quaternion.identity,
scale = Vector3.one,
up=instance.transform.up,
health=Random.Range(33,99),
isDead=false,
go=instance,
r=instance.GetComponent<Renderer>()
};
}
}

private bool m_allDead;
void Update ()
{
float timeStamp = Time.realtimeSinceStartup;

//keep ‘HOT’ field access together,
//should make access faster because of locality of reference
//iterate / ‘Manage’ a group of objects to preserve:
// - temporal locality
// - spatial locality
for (int i=0; i<data.Length; i++)
{
data[i].pos += data[i].up * Random.Range(-0.1f,0.1f);
if( — data[i].health < 0) data[i].isDead = true;

//'unrolled' inner loop
if(!m_allDead && !data[i].isDead)
{
continue;
}
else if(!m_allDead && i==data.Length-1)
{
m_allDead = true;
}
else if(m_allDead && i<data.Length-1)
{
data[i].health = Random.Range(33,99);
data[i].isDead = false;
}
else if(m_allDead && i==data.Length-1)
{
data[i].health = Random.Range(33,99);
data[i].isDead = false;
m_allDead = false;
}
}

//now process references…
// - temporal locality
for (int i=0; i<data.Length; i++)
{
data[i].go.transform.position = data[i].pos;
data[i].go.transform.rotation = data[i].rot;
data[i].go.transform.scale = data[i].scale;
}
          for (int i=0; i<data.Length; i++) 
data[i].r.enabled = data[i].isDead;

Debug.Log (“frame took: “ +
(Time.realtimeSinceStartup — timeStamp) + “ fps: “ +
(1.0f/(Time.realtimeSinceStartup — timeStamp)) +
“@60”);
}
}
}

TL;DR

Instead of operating directly on the GameObject’s components as references (technically copies-of-references) The data is separated into an array of structs, the data is operated on sequentially and then the components are updated from the data. The ‘naive’ approach is presented at the bottom of the article for comparison.

On my simplistic tests, the results were as follows (Naive on the left, optimised on the right.)

The optimised code on the right runs around 50% faster than that on the left…

Why does this work ?

The main feature of the optimised code is that the data is stored contiguously in memory and that makes it easy for the CPU to consume. Although the implementation is specific to each architecture, it is generally true that modern CPU’s pull ‘lines’ of data out of memory and into a cache. Cache memory is faster to access than main memory, so the longer memory can be kept in the cache, the faster the processing of the data should be.

Image from wikipedia — note the data cache (right) is not the only cache, instructions are also cached (left) and in a variety of sizes (‘levels’)

Two ways we can aim to keep data in the cache are by storing data in the sequence it will be accessed (spatial/physical locality) and by writing our code to maximise the repeated processing of the same data (temporal locality.)

Spatial Locality

In the code sample above, the data is separated from the GameObject’s components and stored in a struct, not a class, this is because:

  • structs store their fields contiguously in memory
  • an array stores structs contiguously in memory
  • structs are value types

Why is it important that structs are value types? Consider the diagram below which shows how an object reference is stored in memory:

From: http://www.codeproject.com/Articles/20481/NET-Type-Internals-From-a-Microsoft-CLR-Perspective — There are multiple layers of indirection to access members of an object…

As you can see multiple indirection prevents spatial locality (we have no power over where the refered objects are allocated.)

Temporal Locality

Once the data held in the structs has been processed, we have to apply it back to the components, which are reference types; since we can’t enforce spatial locality we ‘fall-back’ to temporal locality.

          for (int i=0; i<data.Length; i++) 
{
data[i].go.transform.position = data[i].pos;
//data[i].go.transform.rotation
//data[i].go.transform.scale
}
          for (int i=0; i<data.Length; i++) 
data[i].r.enabled = data[i].isDead;

By processing each reference type seperately we keep them ‘hot’ in memory


In this post I talked about the how a contiguous memory layout can take advantage of CPU caching, seemingly inspite of the highly abstracted nature of c# code in Unity3d. I present these findings in the spirit of knowledge-sharing and freely acknowledge that this neither a great way to structure your code, or a conclusive proof, especially with the continued development of IL2CPP.


For completeness, here is the ‘naive’ baseline implementation I was testing against:

//NaiveData.cs added to prefab...
[RequireComponent(typeof(Renderer))]
public class NaiveData : MonoBehaviour
{
public int health;
public bool isDead;
public Renderer r;
}
namespace ExpOne
{
public class NaiveProcess : MonoBehaviour
{
public GameObject prefab;
private List<GameObject> m_go;
public int amt = 1000;

void Start ()
{
float seg = (Mathf.PI*2) / amt;
m_go = new List<GameObject> ();
for(int i=0; i<amt; i++)
{
GameObject instance = Instantiate(
prefab,
new Vector2(Mathf.Sin(seg*i),
Mathf.Cos(seg*i)) * 2.0f,
Quaternion.identity) as GameObject;
instance.GetComponent<NaiveData>().health =
Random.Range(33,99);
instance.GetComponent<NaiveData>().r =
instance.GetComponent<Renderer>();
m_go.Add(instance);
}
}

private bool allDead;
void Update ()
{
float timeStamp = Time.realtimeSinceStartup;
for (int i=0; i<m_go.Count; i++)
{
m_go[i].transform.position +=
m_go[i].transform.up * Random.Range(-0.1f,0.1f);
m_go[i].transform.rotation = Quaternion.identity;
m_go[i].transform.scale = Vector3.one;
           NaiveData data = m_go[i].GetComponent<NaiveData>();

if( — data.health < 0)
{
data.isDead = true;
data.r.enabled = false;
}

if(!allDead && !data.isDead)
{
continue;
}
else if(!allDead && i==m_go.Count-1)
{
allDead = true;
}
else if(allDead && i<m_go.Count-1)
{
data.health = Random.Range(33,99);
data.isDead = false;
data.r.enabled = true;
}
else if(allDead && i==m_go.Count-1)
{
data.health = Random.Range(33,99);
data.isDead = false;
data.r.enabled = true;
allDead = false;
}
}

Debug.Log (“frame took: “ +
(Time.realtimeSinceStartup — timeStamp) + “ fps: “ +
(1.0f/(Time.realtimeSinceStartup — timeStamp)) + “@60”);
}
}
}