Solving multithreaded raytracing issues with C++11

Foster Brereton
9 min readSep 29, 2014

I’ve been told a good programmer should solve a problem at least three different ways. This makes sense, as I view most of my code as crap until I have revisited it at least twice. Applying this to the problem of concurrency in a ray-tracing engine turned up what I hope will be of some value to others, if for nothing else than to underscore a philosophy of pursuing the art of code.

I ran across a web site analyzing Andrew Kensler’s business card ray-tracer. While it is an impressive feat to get a C++ program off the ground in the space of a business card, to implement a full-blown ray-tracer in that span is just… magic. As is often the case with such demonstrations, I found myself inspired to implement my own naïve engine. My goal was to learn the basic techniques and to satisfy a long-standing personal curiosity. I started out with scratchapixel’s detailed lessons as a guide. These include the example scene (above) used to test my engine.

From a high level perspective, ray-tracing is a surprisingly simple concept. It involves casting rays from the pixel-plane of an image into a scene filled with objects, lights, textures, and other visual details. What comes back from a ray cast is a single color: some nuanced derivative of that ray being reflected, subdivided, refracted, and otherwise bounced around the objects in the scene. The process is then repeated for every pixel in the image.

In pseudocode, the main render loop might look something like this:

image_t  image(width, height);
pixel_t* pixel(&image[0]);
for (std::size_t y(0); y < height; ++y)
{
for (std::size_t x(0); x < width; ++x)
{
*pixel++ = trace(ray, scene, ...);
}
}

The details of trace are immaterial for this article- my goal is to improve the performance of the calls to trace, not trace itself. Timing a ray-tracing of scratchapixel’s sample scene (seen at the top of this article) shows:

real 0m1.701s    user 0m1.602s    sys 0m0.061s

So there’s the benchmark to beat: 1.7 seconds to ray-trace a 2048 x 1536 image- roughly 3.1 million pixels.

Thanks to Menumeters I can observe what my 8-core MacBook Pro is doing under the hood, and the story is as you would expect: a single core is doing all the work (the blue bar), while the others sit idle.

7 cores are feeling a little left out.

Lame. We can do better, right? Fortunately, ray-tracing is highly parallelizable: each pixel is calculated independently of the others, allowing each call to trace to operate in its own thread.

The ease of std::async

Since the advent of C++11 we have been given language-native concurrency intrinsics. These tools make it easier to add threading support to an application…somewhat. I qualify the claim because on one hand these intrinsics undoubtedly make it easier to leverage threading primitives. However incorporating them into your application in such a way that it becomes more performant can be tricky.

My first thought was to simply wrap trace with in a call to std::async- the goal, after all, is to do the same thing being done now, only splitting the work over multiple cores. The intent of std::async is to execute a procedure on a separate thread, eventually returning some shared state back to the caller. The first half of that sounds like what we want, and thanks to c++11 it is a snap to instantiate:

for (std::size_t y(0); y < height; ++y)
{
for (std::size_t x(0); x < width; ++x)
{
std::async([=, &scene]()
{
*pixel++ = trace(ray, scene, ...);
}
}
}

I need but add one line of code to make this routine multithreaded? Very cool. We use an assignment capture specification in the lambda to copy the variables in, but we pass scene by reference: copying it around would be needless. (Performance, people!) As we break out the champagne to celebrate how easy it was to add multithreading to our application, we run a quick time test to see how much faster things have become:

real 1m8.297s    user 0m30.867s    sys 1m9.634s

What the what? This is a pessimization- the opposite of an optimization! Based on those timing results I am seeing a majority of the time spent in system overhead. Not good. The processor chart, too, confirms what the timer reveals:

What gives, std::async?

It is obvious there is work being shared across threads (as I have more than one bar filled), but unfortunately not a lot of the work is the renderer, but the system, and I am still not making the most of the cores available. The red portion of the bars reflect the obscene amount of system time spent using std::async this way. (It is interesting to note that every other core is being put to use here. My only guess is that the system overhead handicaps hyper-threading so severely that the logical cores have hit physical limitations.)

So what’s going on?

The burden of std::future

std::async creates a relationship between its thread and the thread that invoked it. This is the means by which the threaded routine transmits its resultant state to the calling thread. Briefly, std::async holds on to a std::promise and returns to the caller an associated std::future. The caveat to this relationship is that (specifically in this case) std::future’s destructor blocks the calling thread as it waits for the subthread to complete. Therefore, my repeated calls to std::async are actually executing in serial as each std::future returned is immediately destructed and blocks. When combined with the extra overhead spent constructing support structures, the hit to performance is justified.

What about saving off the std::futures in a vector, and destructing the vector at the end of the routine? That would get the renderer out from under the serial curse, right? Seems like an easy enough change, but what does the timer reveal:

real 0m55.225s    user 0m28.660s    sys 1m8.420s

Clearly not as performant, a smidgen faster than the previous case, but the system overhead is almost the same. Maybe it has something to do with the 3.1 million promises, futures, and who knows what else I just instantiated? Calling std::async per-pixel puts an implicit synchronization burden on data structures that are intended for sharing state.

This course of solutions isn’t going to work.

So what other structures exist to facilitate concurrency? What about atomics? Might they be useful here, and if so, how?

Incorporating Atomics

The problem with synchronizing a value across threads comes when two or more threads try to modify it. On modern CPUs, modifying the simplest variable in memory requires three steps: load the variable from memory into the CPU, modify it, then save it back out. If there are two threads performing the same three step-process on the same value, the multiple loads and stores could overlap, and the variable’s final value could be anything when they are through.

An atomic variable guarantees that its value is free of race conditions, meaning that concurrent attempts to write the same variable happen in serial. The load-modify-store operation triplet becomes indivisible, or atomic. Well, that’s nice, but how do I use it?

By using an atomic variable to track which pixels are being ray-traced, I can call std::async once per core, reducing my instantiation counts from 3.1 million to cores (a number conveniently provided by C++11 via std::hardware_concurrency). I will need to rejigger the loop to accommodate this shift in sync burden, but my guess is it will be well worth it:

std::size_t max = width * height;
std::size_t cores = std::thread::hardware_concurrency();
volatile std::atomic<std::size_t> count(0);
std::vector<std::future<void>> future_vector;
while (cores--)
future_vector.emplace_back(
std::async([=, &scene, &count]()
{
while (true)
{
std::size_t index = count++;
if (index >= max)
break;
std::size_t x = index % width;
std::size_t y = index / width;
...
pixel[index] = trace(ray, scene, ...);
}
}));

Oofta, there’s a bit more boilerplate than before. Let’s see why.

I transformed the nested loops into to a flat one. Now the domain of iteration is simply all the pixels, so I derive the x and y values of the ray from the index of the pixel currently being processed. I did that so I could use the atomic count to keep track of the latest pixel to be claimed for ray-tracing. The threads communicate with each other by modifying count: since it is atomic, we can be assured that its value will stay in sync across threads. (Note I captured count by reference in the lambda.)

Saving off count to index and using it instead of reading count directly is key. The reason is that count is also being used by other threads at the same time, so there is no guarantee its value will stay the same across reads. By saving off the value of count and post-incrementing it, the subthread gets a unique index into the pixel array which it can then go and ray-trace. In this way the threads keep from stepping on each other’s toes.

So I have cores threads now, all churning through pixels as fast as they can, and I am confident they are not stepping on each other’s toes. But how do I know when I am done? It turns out the std::vector of std::futures serves this purpose well. At the point the vector destructs the main thread will block, pending the completion of all the subthreads.

So from a timing perspective, how did we do? Let’s take a look:

real 0m0.837s    user 0m4.820s    sys 0m0.055s

Now we are getting somewhere! We have managed to cut down our rendering time from 1.7 to 0.8 seconds- better than half the time. Our system overhead is a fraction of the total time spent doing work, which is much more appropriate. The CPU bars tell a similar story:

My office is 104 degrees now, but who cares!?

That’s some processing muscle. But can we do better? Well, maybe.

Eliminating Atomics

Combining std::atomics with std::async dramatically improved ray-tracing performance, and has made this whole endeavor worthwhile. Is it possible to squeeze a little more performance out of the setup? The reality is that std::atomic, while far cheaper to construct and manage than its preceeding construct, is still causing threads to block one another. If there are two threads that are simultaneously trying to write count, one of them has to stop until the other is done.

Recall the whole purpose of std::atomic is to keep threads from stepping on each other’s toes. When one thread claims a pixel to ray-trace, we want to assert that no other thread will attempt to trace that same pixel. std::atomic accomplishes this for us in a first-come, first-serve sort of way. What if we predefine a range of pixels a thread owns as it is being spawned? That way we accomplish the same mutual exclusion without the need for the threads to communicate with one another. Here’s what that might look like:

std::size_t max = width * height;
std::size_t cores = std::thread::hardware_concurrency();
std::vector<std::future<void>> future_vector;
for (std::size_t i(0); i < cores; ++i)
future_vector.emplace_back(std::async([=, &scene]()
{
for (std::size_t index(i); index < max; index += cores)
{
// internals are the same
}
}));

The atomics have been completely removed in favor of each subthread having its own unique set of pixels they must process. They all start at a different initial pixel and use cores as a skip count, guaranteeing each pixel will be ray-traced. Is this better? In our case it is, but only slightly:

real 0m0.805s    user 0m4.658s    sys 0m0.040s

We have reduced the overall time taken by ~4% and taken ~20% off the top of the system overhead. An improvement, to be sure, but I harbor a concern about this method. Specifically, each subthread owns a unique subdivision of the image, and none knows about the existence of the others. What happens when, in the course of processing, one core finishes its image slice? That subthread terminates and now I’m down a core when there may be work to be done elsewhere. Has what’s been gained (a small bump in speed) worth what might potentially be lost (speed lost from unequal distribution of work)?

Conclusion

In the end, then, I reverted the code back to the case where an atomic was managing the shared pool of pixels. That way I know each core will be used during the entire render, accepting a negligible hit from the subthreads blocking one another. The third solution, while promising from the outset, ended up not being worth the worst-case risk. C++11 made this process much more straightforward than it used to be, but it still required some understanding of how the constructs work to formulate a solution that actually improved performance. In all likelihood I will revisit this routine at some point or another with a new idea, in the hopes of eking out a handful more cycles.

--

--

Foster Brereton

Senior software developer, @Photoshop core team member, 15 year veteran at @Adobe, and advocate of the Oxford comma.