spools of thread on rusty crate with Rayon label on crate
Created using images from publicdomainimages.net under CC0 Public Domain license

Part II

Fearless Concurrency with Rust

Threading

Herbert Wolverson
The Pragmatic Programmers
5 min readJul 7, 2022

--

https://pragprog.com/newsletter/
https://pragprog.com/newsletter/

In part 1 of this series, we discussed how Rust uses lifetimes and the borrow-checker to save you from data races. We even compared a C++ program with a threading bug with an equivalent Rust program — and saw how Rust won’t let the buggy version compile.

In the second part of this series, we’re going to look at ways in which Rust can help with a common design pattern: breaking up a large block of calculations, diving them between CPUs, and combining the results. We’ll continue to use a deliberately inefficient function to find prime numbers in a set of inputs. This isn’t overly useful work, but it’s a good representation of common data-processing tasks. The function we’re using is the same as last time:

For each of the strategies we consider, we’ll require our program to:

  • Create a vector containing the numbers 2 through 2,000,000.
  • Filter the vector to contain only prime numbers, confirmed with the is_prime function.
  • Print the number of prime numbers it found and the time taken to perform the calculation.

I’m running all of these on my work PC. It has a 12th generation Intel Core i7 with 12 physical cores (and 20 logical CPUs), 32 Gb RAM and fast M2 storage.

Single-Threaded Workload

As a baseline, we’ll start with a single-threaded version. We’re using Rust’s iterators to keep it as short and to the point as possible:

Using a single core, the calculation takes 712 seconds in debug mode (cargo run) — and 89 seconds in release mode ( cargo run — release).

That serves as a good baseline, but we can definitely do better than that!

Dividing and Threading with Native Rust

Using only native Rust (no external dependencies), we can make use of some Rust features to help us divide the workload. Let’s hard-code the assumption that I have 12 CPUs (otherwise we’d have to use the num_cpus crate — or equivalent — to discover that in a portable way):

This code starts by creating a vector of the numbers 2 through 2,000,000. It then uses chunks — a built-in function of Vec — to divide the candidates into 12 chunks. We can use into_iter() on the resulting chunks to iterate over each chunk.

chunk.to_owned() isn’t obvious; we “take ownership” of each chunk as we process it. This approach converts it from being a slice of references to candidates to a vector of actual numbers. In release mode, this is essentially free — the compiler optimizes away several memory copies. It’s still not as clear as I’d like.

Now that we have our own personal chunk — we spawn a thread that will return a Result<Vec<u32>> containing prime numbers from that chunk. We return the JoinHandle from each thread. A JoinHandle is a pointer to the thread that will contain the result when it finishes. You can call join on a thread to wait for it to finish.

At the end, we drain our vector of thread handles and combine the results. (Drain removes each item from the vector as it is processed. Since we won’t be reusing the vector, it’s a good way to avoid ownership issues).

This strategy offers a significant speed-up. It finishes in 123 seconds in debug mode and only 15 seconds in release mode.

Using Rayon

Rayon is a popular Rust crate that implements a lot of threading features. It offers a worker-thread pool, work stealing (an idle thread will take work from the pending queue), parallel iterators, and many other features to make multi-threading easy. The downside is that you are adding a dependency to your program.

You can enable Rayon by adding it to your program’s Cargo.toml file:

[dependencies]
rayon = "1.5"

The big advantage of using Rayon is that it greatly simplifies your code. You can make some tiny changes to the original single-threaded version and benefit from an immediate speed-up. Here’s the Rayon version:

Notice that we’re importing two features from Rayon: IntoParallelIterator and ParallelIterator. These are required to transform a range 2..2_000_000 into a parallel iterator and to perform the iteration in parallel. We don’t divide our candidates into chunks or worry about thread joining — Rayon takes care of all of that.

Rayon also retains all of the safety guarantees you expect from Rust.

Best of all? It’s even faster. It completed in 72 seconds in debug mode, and 8 seconds in release mode.

The Results

Let’s compare the results of the three strategies we tested:

In terms of complexity, the single-threaded version is shortest — followed by Rayon. The chunked version is more than twice the size of the single-threaded version.

Surprisingly, despite the added complexity — the chunk strategy is not the fastest. That prize goes to Rayon:

Rayon is only a little faster, but combined with the simplicity of using the library, we can safely say that for iterator-based combinatorial data processing Rayon is the clear winner.

Wrap-Up

Dividing workloads into threads with native Rust is a little complicated, but works very well. It offers a good performance boost over a single-threaded model. At 47 lines of code (including comments), it’s a lot more complicated than the single-threaded model — but you can be certain that it is fast and safe.

If adding a dependency to Rayon is acceptable to you, you can easily combine the simplicity of the single-threaded model with the sheer horsepower of a full-featured concurrency model. It’s even faster than the native Rust version and retains all of the safety features you expect.

In the final part of this series, we’ll look at the other concurrency system built into Rust: asynchronous execution.

--

--