How I sped up my Rust program: From 30+ minutes to a few seconds

Photo by chuttersnap on Unsplash

Ever since 2017 I have been writing my hobby projects in Rust. It’s not only the language which I fell in love with but also its speed. ❤️ Rust is known for its amazing speed. But, just because the language is fast does not mean all you apps will be as well. Here is my journey of how I sped up a program from running for over half an hour to only a few seconds.

Starting point

Before getting into the details of how I was able to speed up my program, let’s establish a baseline. For this I just created a release build of my program, a parser generator, and measured the execution time with:

$ time cargo run --release

The total execution took 32 minutes! Way too long.

Now the first question to answer is: what part of my code is actually taking so long? Since I am on macOS, my profiling program of choice for Rust is the cargo-instruments plugin. (A good tool for Linux is supposed to be perf.)

The main information from cargo-instruments is that the biggest bottleneck is actually my main routine which translates the grammar definition into the actual parser. While that means that the program does not wast time outside my main functionality, it does not give me a good overview of where the bottlenecks are either.

To actually figure out which parts took too long, I inserted profiling statements to measured the time of every single routine. So let’s actually dive into my performance improvements.

1. Memory usage

While this was not the first thing I noticed, it was the thing which had the biggest impact on the overall performance of the program: Memory usage. In my case around 32GB — not a good value if your computer only has 8GB of RAM.

32 GB memory usage

If a program tries to use more memory than the computer has physically available, the computer will start to swap some of the RAM contents onto the disk and then load it, once it is needed. Which makes it really slow.

As a web developer, your dynamic language of choice probably contains a performance optimization to mitigate excessive memory consumption. Modern run times ensure that you only hold one copy of any given string in memory. If you have multiple variables with the same value, behind the scenes, the run time will only allocate the string once and then reference the value from everywhere else. Thereby reducing the memory consumption. The string will only be copied on mutation.

Why do I tell you this? This is an optimisation, I implemented for my program. So in my case, I have a central data structure which is call Grammar . When you add something to this object, e.g. a token, the object actually turns the token into an internal representation (here literally called InternalToken) by saving the string in a value array and only holding the array index as reference in the internal token.

Quick tangent: A token is a string with meaning in a programming language. if is a token, so is else , but also your_function_name and variable_name .

The central data structure.
Ensuring all token names are only allocated once and then referenced.

2. Memory allocations

Another thing I noticed when starting my program using Instruments was the allocation count: 12+ Million allocations. No wonder that my program is slow.

The first way I was able to reduce the number of allocations was by ensuring that arrays do not contain any duplicates. In my case, the parser generator calculates which input symbols and tokens (e.g. an if) may be read next and kept adding the same elements to an array. The maximum array length with duplicates was 2+ million elements. A huge wast when we considering that the whole language only has about 200 unique tokens.

This optimization can be achieved by literally a single if statement!

Let’s take inventory for a moment: the two optimizations resulted in an execution time cut down of 24 minutes to 8 minutes and a memory reduction from 32GB to about 1.5 GB. That is already pretty good.

3. Re-implement a standard function if you can save memory operations

Another source of a lot of memory allocations was a search for a struct in an array. When searching for a struct in an array, you have to have the struct you are searching the array for.

In my case, I did not have the struct but only its value and therefore needed to create it in order to search for it. Creating an instance of my struct included cloning a few elements which resulted in many unnecessary allocations.

The solution I ended up with was to re-implement the contains function of Vec for this purpose. You obviously should not just re-implement a standard function which is not in your hot path. In my case though, this search is executed a few hundred thousand times so re-implementing and thereby remove the allocation does make a lot of sense.

The new implementation allowed me to search all elements and check the relevant struct properties and thereby removed multiple thousand memory allocations.

Quick tip: The Rust standard library is implemented in Rust so you can therefore read the code and see how the function is implemented.

4. Caching

Performance is the art of not doing things.

The obvious performance optimization: add a cache. In parser generators you often have to calculate which input symbols may be read in next if you are parsing a correct code piece. Luckily a fair junk of that can be cached. This saved me another 60 seconds of execution time.

However, you should always profile your caches. I added around 5 caching layers for different parts of the calculations. Only 2 of them turned out to help the performance whereas 3 actually hindered performance. So caching is actually a two sided story. It can speed up your program dramatically but implemented for the wrong elements it can slow it down as just as much.

5. Bonus: Rust compile times

I will get into my final execution speed and memory usage in just a second. But before we get into that a word about compilation speeds. A parser generator generates parsers — as the name says. I am working on a PHP parser which in its current form generates 25k lines of Rust code.

When all these lines where generated into roughly 2 functions, the rust compiler used up 30GB of memory himself to compile the code. It was really slow (10+ minutes). So I switched the code generation to generate a few thousand small functions which call each other (there was no code duplication and therefore all code is still there). That generates around 5 thousand lines more code but results in the compiler using only around 1.3GB of memory. It also finishes within 2.5 minutes.

In other words, the smaller the functions the faster can Rust compile your source code. Even if that results in more lines of code.

Summary

Let’s rerun my parser generator and see what all these performance optimizations left us with:

$ time cargo run --release
2.53s user 0.13s system 97% cpu 2.748 total

The overall execution time is now about 2.5 seconds and the generator uses only around 150MB of memory. Overall it took me 5 evenings to find all bottlenecks and refactor the application to remove them.

I am very proud of the results and I think it’s time to celebrate 🎉

So summing up, if you notice performance problems in your Rust program, check the following:

  • Are you using an acceptable amount of memory? If you need more memory than the computer has available, you will slow down your program dramatically.
  • Are you allocating memory unnecessarily often? Try to remove all non-mandatory allocations in your hot path.
  • Can you optimize functionality in your hot path even if that means rewriting standard implementations? If that speeds up your hot path, go for it.
  • Can you cache parts of your calculations? If so does the cache actually speed up your program?
  • Bonus: Are you using small functions? The Rust compiler will compile a lot fast with smaller functions.

I hope this entry helps you to speed up your programs. Hopefully your programs run fast all by themselves. 😉