MT Strats: Thread-local Reduce

The last few of my posts have been about some of the features of Mana Engine. I wanted to spend a bit of time talking about some of the strategies Rob and I used to ensure thread safety.

So here begins what I hope to be a series of posts call Multi-threaded Strategies (or MT Strats for short). In each post will be a strategy you can employ to tackle your own multi-threaded code.

There is some ground-work I’ll lay out first. All of the posts will assume that you’re programming using a task-based system. So instead of spawning hundreds of threads, your code would spawn one thread per logical core on the machine. Those threads would effectively be kept in a while loop, working on whatever tasks (basically lambdas/functors) are in the work queue.

This has the benefit of being able to split up work easily. Need to go wide on 8,000 objects? If you’re using 8 cores, each core iterates over a thousand of them. Lots of strategies become easier to do when you have this setup — assume this setup in all future MT Strats posts.

If you’re not familiar with this setup, a short google search for “job queue c++” returned lots of good results, this one on the top of the list. Give it a gander to get up to speed on job queues.

I will add that a an essential feature to have is to use threadlocal storage to store the worker thread’s index. So where the main thread is index 0, each worker is from 1-n. Being able to call GetThreadIndex() from anywhere will be key in lots of these strategies.

Because it bears repeating, these are the two things:

  1. Job-queue, with as many worker threads as logical cores (or fewer).
  2. GetThreadIndex() function.

This first strategy is one of my favorites because it doesn’t require any of the typical multi-threading constructs like locks and atomics. Certainly, those things can help make the solution more robust, but you can get by without them.

I call it Thread-local Reduce. The gist of it is that each thread gets their own buffer to write to and then sometime later, readers can access the data however they want (as long as nobody is actively writing).

As a simple example, lets say you need to loop through some thousands of objects, and if each one meets a condition, do something to them. In a single-threaded codebase, it might look like this:

void DoSomeWork(const Array<Object>& objects)
{
// Iterate over the objects and do the work if they
// meet the condition.
for( auto&& obj : objects)
{
if(DoesMeetFilter(object))
{
DoFilteredWork(pObj);
}
}
}

This is alright, but even if it’s single-threaded, there’s a better way to write it (that lets us pivot into multi-threaded code when we need to).

void DoSomeWork(const Array<Object>& objects)
{
Array<Objects*> filteredObjects;

// Iterate over the objects and filter them.
for( auto&& obj : objects)
{
if(DoesMeetFilter(object))
{
filteredObjects.PushBack(&obj);
}
}

// Iterate through the filtered objects.
for( auto* pObj : filteredObjects )
{
DoFilteredWork(pObj);
}
}

There are some small benefits that may or may not affect performance. Because we have two tight loops instead of one looser loop, instruction cache locality is well preserved. Without profiling, I won’t pretend that actually means anything, but it’s something to consider.

The real meat of it comes in the next step: Making it multi-threaded. For this example and to keep the code here cleaner and more conceptual, lets assume you also have some sort of parallel_for construct.

//Assume we have only 8 threads.
static constexpr uint32 maxThreads = 8;
void DoSomeWork(const Array<Object>& objects)
{
// One array of pointers for each thread.
Array<Objects*> filteredObjects[maxThreads];

// Iterate over the objects and filter them.
parallel_for( objects, (const Object& obj)
{
uint32 threadIndex = GetThreadIndex();
if(DoesMeetFilter(object))
{
// Only insert into this thread's array.
filteredObjects[threadIndex].PushBack(&obj);
}
});

// Iterate through the filtered objects.
for( auto&& buffer : filteredObjects )
{
for(auto* pObj : buffer )
{
DoFilteredWork(pObj);
}
}
}

Of course, we can even go a step farther and go wide when calling if we want. If we expect to only find a few that meet the filter, it may not be worth doing so. If we expect to find a lot of objects or if is a heavier function, then there’s something to consider there.

This solution has so many practical applications. It can be used on a small scale to quickly execute a function, such as in the above case. It can also be used on a larger scale, as part of an architecture, where you could be storing off values to be operated on later at almost any time.

Here are some quick notes about doing this:

  1. A key here is making sure that no thread is reading from the thread-local buffers when the threads are writing to them. In a local scope like the above example this is easy to pull off. When the thread-local buffers live much longer, possibly throughout several frames, ensuring this gets harder and harder to do. Locks can be helpful here, at the cost of… well, locking.
  2. This solution can work with more than just arrays. You can use it with dictionaries and other containers too.
  3. Because only one thread writes to the containers, these containers don’t need to be thread-safe, making it a fairly low-tech (but highly effective) solution.
  4. I know I said that this doesn’t require any special multi-threading stuff and then later just tossed in a parallel_for like it was no big. If these go well, I’ll write a post on making a parallel_for as one of these MT Strats posts as an apology.

Alright, so there’s one of the many MT Strats Rob and I have in our toolbelt. Now it’s an MT Strat in your toolbelt, so go out and use it!