Intro to Jobs/Burst/DoD
I often talk about performance gains I get on projects and people simply don’t believe them, especially when things get 100 times faster than the original code or more. And when I talk about the techniques I use to do this, people want practical examples because they can’t get their head some aspect of what I’m talking about, or glass over when I talk about cache misses and structure sizes. It all seems super technical and complex, but it’s really not that hard, and you don’t need to dig into assembly to get massive gains.
So in this post, we’re going to take a small piece of code and convert it through several versions, from a full Object Oriented implementation with the data scattered around, to a Data Oriented Design version where the data is well organized, to one which runs in parallel using Jobs/Burst in Unity.
All code can be found here
The Example
It’s always interesting trying to figure out the right example for something like this- if it’s from an actual code base, it’s often too specific for someone to understand and grok how to apply it to what they are doing. So I’ve tried to find something small enough and universal enough to understand, while rooted in typical game patterns you see in Unity.
Monster.cs
public class Monster : MonoBehaviour
{
public float health = 100;
public float maxHealth = 100;
public float healthRegenRate = 1; // per second
public float stamina = 100;
public float maxStamina = 100;
public float staminaRegenRate = 1; public void Update()
{
if (health > 0 && health < maxHealth)
{
health += healthRegenRate * Time.deltaTime;
if (health > maxHealth)
health = maxHealth;
}
if (stamina > 0 && stamina < maxStamina)
{
stamina += staminaRegenRate * Time.deltaTime;
if (stamina > maxStamina)
stamina = maxStamina;
}
}
Our example is a component with an update loop. In this case, our monster has health and stamina that regenerate over time. This might be one of a dozen components in our AI stack, for instance.
Now to make the profiling numbers more apparent, we’re going to have to pretend this is a game with a lot of monsters. 200,000 in this case. This is where the ‘artificiality’ of the example comes in — for instance, processing texture or audio data would be a more real-world use case, but that requires that the reader understand texture or audio data and our use case, where as I hope this is more understandable for everyone. Our first demo just does:
Demo01.cs
public class Demo01 : MonoBehaviour
{
const int maxCreatures = 200000; void Start()
{
for (int i = 0; i < maxCreatures; ++i)
{
GameObject go = new GameObject(i.ToString());
go.AddComponent<Monster>();
}
}
}
A quick in-editor profile shows:
Update
So the first thing to get rid of is the Update loop. Update is notoriously slower than calling it yourself. So we’ll add a Poll method into the Monster.cs file that does the Update code:
public void Poll()
{
if (health > 0 && health < maxHealth)
{
health += healthRegenRate * Time.deltaTime;
if (health > maxHealth)
health = maxHealth;
}
if (stamina > 0 && stamina < maxStamina)
{
stamina += staminaRegenRate * Time.deltaTime;
if (stamina > maxStamina)
stamina = maxStamina;
}
}
Then our new demo simply keeps an array of monsters to call Poll on, and disables the component so that it won’t call it’s Update loop:
public class Demo02 : MonoBehaviour
{
const int maxCreatures = 200000; Monster[] monsters = new Monster[maxCreatures]; void Start()
{
List<Monster> data = new List<Monster>();
for (int i = 0; i < maxCreatures; ++i)
{
GameObject go = new GameObject(i.ToString());
var m = go.AddComponent<Monster>();
m.enabled = false; // disable so we don't call update!
monsteers[i] = m;
}
} void Update()
{
for (int i = 0; i < maxCreatures; ++i)
{
monsters[i].Poll();
}
}
}
Now lets look at the in-editor timings:
Some minor points to note; because this is a somewhat artificial example:
- using a list vs array, or using monsters.Length in the for loop, can have huge effects (30%) on the timings. Keep this in mind if your writing a large loop.
- We have created all these game objects and components in order, and we are polling them in order. In a real game, that list would likely end up not in order from things coming and going. Further, how that data is allocated in memory is very engine specific and not clear from the code.
To account for that second point, lets try randomizing the order of the monsters in the list such that memory access is no longer linear:
We can change the start function to:
public bool randomAccess = false;
void Start()
{
List<Monster> data = new List<Monster>();
for (int i = 0; i < maxCreatures; ++i)
{
GameObject go = new GameObject(i.ToString());
var m = go.AddComponent<Monster>();
data.Add(m);
m.enabled = false; // disable so we don't call update!
}
monsters = data.ToArray();
if (randomAccess)
{
var rnd = new System.Random();
var randomized = data.OrderBy(item => rnd.Next());
monsters = randomized.ToArray();
}
}
Now we have a little toggle that lets us test with scrambled access to the elements of the array:
Here we see that by randomizing the access order, the timing is much slower. This is because we are forcing a cache miss on every access, as it’s unlikely that the memory for the next component is anywhere near the last one. But it’s still way faster than Update().
If we, however, get the list again from unity using:
monsters = GameObject.FindObjectsOfType<Monster>();
Then we get back down to the original timings. What this tells us:
- Unity internally stores components of any given type in a continuous array, and calls update on them in order.
- Managing our own list must be done with care, otherwise it may become unordered and run slower than if we call in the correct order.
- Either way, it’s still way faster than using Update()
- Having top down knowledge of exactly what is in our scene is much faster than having everything perform its own logic.
Organizing our memory
Even though the above example shows that Unity does organize component memory linearly, I still tend to organize data myself when I can instead of putting it on components all over the scene. In this example, we are going to replace the component with a struct that we can allocate in a linear array:
public class Demo03 : MonoBehaviour
{
const int maxCreatures = 200000; struct MonsterData
{
public MonsterData(bool b)
{
health = 100;
maxHealth = 100;
healthRegenRate = 1;
stamina = 100;
maxStamina = 100;
staminaRegenRate = 1;
}
public float health;
public float maxHealth;
public float healthRegenRate; // per second public float stamina;
public float maxStamina;
public float staminaRegenRate; public void Update(float deltaTime)
{
if (health > 0 && health < maxHealth)
{
health += healthRegenRate * deltaTime;
if (health > maxHealth)
health = maxHealth;
}
if (stamina > 0 && stamina < maxStamina)
{
stamina += staminaRegenRate * deltaTime;
if (stamina > maxStamina)
stamina = maxStamina;
}
}
} MonsterData[] data = new MonsterData[maxCreatures]; void Start()
{
for (int i = 0; i < maxCreatures; ++i)
{
data[i] = new MonsterData(true);
}
} void Update()
{
float dt = Time.deltaTime;
for (int i = 0; i < maxCreatures; ++i)
{
data[i].Update(dt);
}
}
}
Now the data is fully under our control and the access of that data cannot get unordered. We see a small speedup vs. the ideal case of the previous code, but really this is just prepping us for having that data centralized to process it better, using Jobs and Burst.
Jobs/Burst
The unity jobs system makes it easy to run our data across multiple cores. And Burst is a new compiler which can heavily optimize code based on restrictions that the job system imposes. Combined they can speed up code by an incredible amount.
Jobs requires that we work with structs of Plain Old Data (POD), like the MonsterData structure we introduced above. Essentially all of the data needed for the job must be copied to the other cores, so references to classes are not permitted. What follows is an exact version of the previous code, but rewritten to use jobs/burst. Lets break it down rather than doing the full listing:
First, rather than using an array of MonsterData, we’re going to use a nativeArray<MonsterData>
NativeArray<MonsterData> monsterData;
and in the start function, we are going to preallocate it:
monsterData = new NativeArray<MonsterData>(maxCreatures, Allocator.Persistent);
There are multiple allocator types depending on how long you want to use the array. In our case, this is going to be kept around until we deallocate it.
Now we’ll write the actual job:
[BurstCompile]
struct RegenJob : IJobParallelFor
{
public NativeArray<MonsterData> monsterDatas;
[ReadOnly] public float deltaTime;
// The code actually running on the job
public void Execute(int i)
{
var m = monsterDatas[i];
if (m.health > 0 && m.health < m.maxHealth)
{
m.health += m.healthRegenRate * deltaTime;
if (m.health > m.maxHealth)
m.health = m.maxHealth;
monsterDatas[i] = m;
}
if (m.stamina > 0 && m.stamina < m.maxStamina)
{
m.stamina += m.staminaRegenRate * deltaTime;
if (m.stamina > m.maxStamina)
m.stamina = m.maxStamina;
monsterDatas[i] = m;
}
}
}
In this case, I’m using an IJobParallelFor job. This is a job designed to break up an array of data into separate jobs, and process chunks of it in parallel. The job will have a native array of monster data that it is working on, along with a readonly delta time that it needs, as we cannot call Time.deltaTime from inside of a job. The execute function gets an index into the array, and performs the same logic as we’ve been using all along.
Finally we modify our update function to create the job, copy the data it needs into the job, schedule it, and wait for the job to complete.
void Update()
{
var regenJob = new RegenJob()
{
monsterDatas = monsterData,
deltaTime = Time.deltaTime
}; var handle0 = regenJob.Schedule(monsterData.Length, 1000);
handle0.Complete();
}
Note the schedule call takes the array length and an integer which I have passed in 1000 for. This means that each job will handle 1000 elements on the array, meaning we are creating 200 jobs for our 200,000 elements. Adjusting this number may yield a more optimal job : job size ratio.
Note that in a real world use case, we would not call Complete on the job handle until we needed the data. For instance, we might kick off a ton of different jobs for all kinds of things, then call complete on LateUpdate after the rest of the game has run, allowing everything to run in parallel and effectively making things even cheaper as they run while other tasks are busy using the main thread.
The full listing:
Demo04.cs
public class Demo04 : MonoBehaviour
{
const int maxCreatures = 200000;
NativeArray<MonsterData> monsterData;
struct MonsterData
{
public MonsterData(bool b)
{
health = 100;
maxHealth = 100;
healthRegenRate = 1;
stamina = 100;
maxStamina = 100;
staminaRegenRate = 1;
}
public float health;
public float maxHealth;
public float healthRegenRate; // per secondpublic float stamina;
public float maxStamina;
public float staminaRegenRate;
}void Start()
{
monsterData = new NativeArray<MonsterData>(maxCreatures, Allocator.Persistent); for (int i = 0; i < maxCreatures; ++i)
{
monsterData[i] = new MonsterData(true);
}
} [BurstCompile]
struct RegenJob : IJobParallelFor
{
public NativeArray<MonsterData> monsterDatas;
[ReadOnly] public float deltaTime;
// The code actually running on the job
public void Execute(int i)
{
var m = monsterDatas[i];
if (m.health > 0 && m.health < m.maxHealth)
{
m.health += m.healthRegenRate * deltaTime;
if (m.health > m.maxHealth)
m.health = m.maxHealth;
monsterDatas[i] = m;
}
if (m.stamina > 0 && m.stamina < m.maxStamina)
{
m.stamina += m.staminaRegenRate * deltaTime;
if (m.stamina > m.maxStamina)
m.stamina = m.maxStamina;
monsterDatas[i] = m;
}
}
} void Update()
{
var regenJob = new RegenJob()
{
monsterDatas = monsterData,
deltaTime = Time.deltaTime
}; var handle0 = regenJob.Schedule(monsterData.Length, 1000);
handle0.Complete();
}
}
Lets check the timings on that (in editor still):
Optimizing Further
Now astute readers will notice that health and stamina are basically the same thing- They each have a max and regen rate, and perform the same operation. So lets try making our regen job a generic job for any stat which does regeneration, and run them as separate jobs. This makes the data needed for the jobs half as big.
So we’ve taken a simple system from 35ms to 0.2ms without changing it’s design. That’s a 175x speedup. And in the Demo05 example, we saw that reducing the memory needed for each job has a significant effect on performance.
So what if we change the design and say that all of our monsters have the same regen rate and max stat values. That would remove 2/3rds of the data needed in the jobs!
In the Build
Up until now, I’ve been doing all my timings in the editor. This is frowned upon because the editor can have significant debug code, etc, but when doing something like this I usually jump from demo1 to about demo4 in one step, so the timings are so vastly different that it really doesn’t matter where I do the timings until I start really digging in on smaller details.
Note that Update is faster in the actual build, but still massively slower than Demo02 with randomized access. Depending on the frame, either demo04 or demo05 could be faster. I believe this is contention based on the increased number of jobs vs. the smaller data size. With some experimentation, you could likely get another significant speedup here, but hey, 207 times faster is not bad, and in most Unity games which barely use the other cores, this can all run while other stuff is happening reducing the cost to near 0.
Summary
So hopefully this shows a practical example of how you can make something over 100x faster by organizing and reducing its data size, while using burst and jobs. Usually when I come in contact with a new code base, I can find a few heavy systems that are centralized enough to easily convert into burst/jobs and greatly help the frame rate.
However, code like used in this example is usually “death by 1000 paper cuts” across an inefficient code base. Where instead of 200,000 monsters running one system, you have hundreds of small systems all with the same issues and data which is heavily tangled together and scattered over the scene. Worse, as this problem is broken into multiple components and larger data sizes, it actually gets much worse. Organizing this data in a top down system like I’ve done here may present certain challenges across gameplay code, which is what the new Entity Component system in Unity hopes to address.
ECS
Essentially, an entity in ECS is just an id which can be used to lookup a bunch of data about an object, and a component is just a small collection of data. With it, you can easily write a query that gathers “All Regen components” from entities scattered across the world, then write a System which processes that data in parallel using jobs/burst. It will automatically gather that data into nice arrays for processing, and let you filter it in complex ways like you might with a database query (All entities with these components, but not this component, etc).
Entity System
- A database for looking up data from your scene’s entities, and gathering it into linear arrays for processing
Job System
- A system for processing arrays of data across multiple CPU cores
Burst Compiler
- A compiler that takes advantage of the restrictions places by the Job system to compile much faster code
The entity system is still under a lot of redesigns, but Jobs and Burst are usable now, and can be very useful without Entities.