From 46s to 5s - Optimizing a 350 Line Raytracer in Rust

Carl Fredrik Samson
8 min readMar 4, 2019

--

Final output of our 350 line raytracer with samples_count = 16384

If you’re new to Rust, or any other language, and you’re like me, you probably start by porting some code from a familiar language. Depending on your background, this will most likely work pretty well, but in Rust — if you only port code from a language like C# or C++ naively, you will probably miss some of the features that makes Rust special.

I thought this was a fun example to showcase this and what impact this can have on a piece of code showing how small changes can have pretty big impact.

In this post I’ll talk about how to port a short raytracer written in C#/C++ codebase to Rust, then applying some simple optimizations by leveraging some features in Rust. Mark my emphasis on simple, we’re not trying to go all the way here and make it as fast as possible, but we’re making some small but significant changes using methods and tools that you’ll normally use in a Rust project.

This all started when reading this blog post.

If we track this code back to its source we find that this started with a piece of code authored by Andrew Kensler as a “postcard sized raytracer” written in C++ for a Pixar recruitment flier.

FABIEN SANGLARD originally deciphered the code on the postcard and wrote a blog post about it here, if you’re interested:

I highly recommend to read that post if you wonder how this code works and what it does. He also explains that we’re actually writing a pathtracer here and not a traditional raytracer.

Now back to our code. The blog post titled “Is C# a low-level language” Matt Warren ported the code line by line from C++ to C#.

After optimizing the C# code it ran only 10% slower than the C++ showing the similarities between C# and C++, however this also shows how fast DotNet Core has become lately.

I’ll explain briefly some key points about this code:

  1. It is written for brevity (it was to fit on the back side of a postcard remember) not for speed or readability. The code is very slow.
  2. The file-format .ppm was chosen since it’s a simple text file format with a very short header: P6 [height] [width] 255
  3. The parameter samples_count found in the start of the main function is what decides the final quality. A quality of two is displayed below. The image in the header of this page had a sample count of 16384, and took 5,5 hours to render on a 6 core i7 8700 CPU with our optimized solution.
  4. We’ll use a samples count of 2 for benchmarking since that runs under a minute for both C# and our Rust versions:
Output with samples_count = 2

How would this compare to Rust?

Let’s start with the line by line port of the code from C# to Rust (or as close as I could get):

The versions of the different compilers and their compiler options are:

DotNet Core 2.1

Compiled as a release build with TC (Tiered Compilation) set to ON.

Cargo 1.33.0

Compiled with --release flag.

And the corresponding runtime measurements on my computer (Macbook 2013, 2.4 GHz i7):

C#:

time: 47.28, mem: 22.67Mb

Rust:

time: 46.17s, mem: 5.10Mb

Rust is pretty similar to the optimized C# when we run the code ported line by line as similar as possible, but the memory usage is less than 1/4 of C#.

There’s a few things to note here:

  1. The code bears clear signs that it’s ported from C++. There’s a limited use of iterators (other than over ranges) and no use of enums or pattern matching.
  2. Look at the lines 144 and 145. This is rather inefficient since Rust will bounds check every time we index into the array, and we do that a lot during execution. C#’s JIT compiler might optimize this away realizing it’s a hot path and that the indexes are never out of bounds.

We could go “unsafe” and use get_unchecked and save a few seconds (I’ve tried it), and it would still be safe in this example since we know that we won’t access out of bounds but I wanted to keep the code safe. There is a much smarter way to do this with a small change as we’ll see below.

Optimizing part 1. Use iterators.

Let’s address the lowest hanging fruit first: the slow indexing into the global array of characters.

If you look at the original code you’ll see that we always use blocks of 4 numbers that represents our letters in the final image. Here it’s encoded as characters first, then collected as their u8 representations into a Vec<i32> (we collect directly to i32to avoid casting them later on). By grouping this in tuples of i32in our lazy_static function we allow our code to use one of Rust’s native constructs: Iterators.

This small change almost halved the execution time giving us the following result:

Time and memory:

time: 26.46s, mem: 5.12Mb

Good start for a simple change.

Optimizing part 2. Buffer the results if you can.

We can do better, with the tradeoff of using a little bit more memory. If you look at the outer loop, you’ll notice that if we visualize this we go from the lower right corner and up to the upper left corner.

What if we calculate the pixel positions first and then iterate through them secondly?

Our code in main will change like this:

This change had a small but noticable impact on our timings. The results are below, but more importantly this sets up all we need to use a secret weapon in our arsenal, which I’ll explain next.

The results from this change gives us:

time: 23.52s, mem: 11.96Mb

Our time went down a little bit, but our memory usage went up to by a factor of 2.5 though.

Optimizing part 3.

The secret weapon you ask? Have you guessed it? It’s called Rayon:

Rayon is an amazing library to have in your Rust toolbelt. Why? Well, you know how people say parallelizing code can be difficult? That’s not true if you can use Rayon.

Rayon is a paralleization library for Rust, and gives us superpowers if you already have a pice of code that uses an iterator to produce a result. This was was our secret master plan when we changed our outer loop to allow us to divide up the work that needs to be done.

You might think this is a big change but actually, all we need to do is this small change in our main function:

Have you spotted it? I left a hint in the code for you. The changes we made were that we pulled rayon::preludeinto scope and replaced iter() with par_iter() and just like that our code now uses all the cores on our computer to calculate the result. So how did this impact our results?

The result from this last change gives us:

time: 5.50s, mem: 111.50Mb

What happened there?

Well, we’re firing in all cylinders here. All our CPUs are used to calculate the result, but there is a tradeoff in increased memory usage, a rather huge one too.

I stopped here as I promised to keep this pretty simple, but if you have any obvious suggestions to show more simple changes that makes our code more Rusty and performant, please submit a PR to the repo, and I’ll try to update the article if I have time.

In this repository you’ll find the optimized version of the code — our last step in this post.

Conclusion.

We haven’t really touched the algorithm or done any major changes. Our line count stayed pretty much the same since we started on 329 LOC and ended up on 335.

Porting C# or C++ code directly to Rust without thinking about how you would have done that if you programmed it in Rust from the start will most likely not expose you to a lot of what makes Rust what it is. The functional aspects of Rust might seem foreign at first but used wisely it will enable some optimizations that will be much harder to do without (like our use of Rayon in this example).

Now, the code above is not an example of good Rust code by any means. It’s still a mix between the original C++ code sprinkled with some Rust constructs, but the goal here was to find a fun way to point to some interesting things to look out for and the impact that might have when coming in to Rust from a language like C#, Java or C++.

We’re finished for now, but I hope you had an interesting read and if you’re new to Rust and port over code from other languages you’ve got some pointers on where to take advantage of Rust to speed up your own code even more.

Bonus: Fixing the huge increase in memory usage.

This will expose you to iterators even more, but I’ll try to explain in the comments what’s happening so bear with me. We’re still in the same part of our main function as we left off:

Thanks to jabagawee for this PR

This is the first time we actually change the method we use to calculate our pixel positions, and there are other ways to accomplish nearly the same without these changes. If you’re interested you can have a look at this gist. However this kept the readability better and the changes were pretty minor. Our change has a significant effect on our stats though:

Our last and final results are now this:

time: 5.14, mem: 7.27Mb

So our memory usage is back to normal with the added bonus of even a slightly faster execution.

Updates:

2019–03–07: Changed header picture and merged a PR that removed the backwards “D”’s inside P and R in our rendering result. No effect on any measurements but the code now gives the exact same result as the C# and C++ versions.

2019–03–08: Added bonus section thanks to a PR that showed how our memory usage could get back to normal with a minor change.

📝 Read this story later in Journal.

🗞 Wake up every Sunday morning to the week’s most noteworthy Tech stories, opinions, and news waiting in your inbox: Get the noteworthy newsletter >

--

--