Rust parallelism for non-C/C++ developers
The bare minimum.
The majority of people coming to Rust have C/C++ background which allows them to easily transition into Rust parallelism since it is so similar. However, for many people coming from other languages, it is a challenge. In this post, we will walk through the standard Rust parallelism tools as well as the motivation behind them. This will require a hardware deep dive at the beginning, followed by an explanation of the low-level tools, like atomics, and ending with an explanation of high-level tools like Mutex. Finally, we will explain how Rust guarantees safety in multi-threaded applications.
In Rust when you hear people talking about parallelism and concurrency it is mostly about the frameworks because Rust as a language does not favor any specific parallelism or concurrency abstraction and instead provides the bare minimum, like standard threads and several synchronization primitives. This bare minimum is what we will be exploring in this post.
Why parallelism is hard
First, we need to understand why parallelism is hard, and the reason is that the hardware, the OS, and the compilers are too complex. Since 1970 processor cores do not work with the memory directly and instead use a complex hierarchy of caches and write buffers.
We don’t even need to review the entire hierarchy to understand why it is hard. Let us remove all the cache and consider the write buffers only. Write buffers are absolutely essential for the performance of the processor since writes to the memory are expensive and we want to batch them as much as possible.
Consider the following program that we run on two cores. The program has a critical section that we do not want to be executed on both cores simultaneously. One of the ways to ensure that is to use intent flags. Intent flags are used by the cores to declare their intent to enter the critical section. Logically, if one of the cores has entered the critical section then their intent flag is non-zero and another core will not enter it. However, if both cores wrote their intent flags without flushing the buffer then they will both enter the critical section because they will read the 0 values of intent flags from the memory.
Another way of thinking about the problem is to think that the write buffer has reordered the order of operations by executing
flag2 != 0 before
flag1 = 1. Similarly, we can think that caches also reorder the operations.
Operations are also reordered by the compiler that does optimizations, like subexpression elimination and by the processor that does prefetching and speculation, among other things. As a result, the order of operations in the source code will be different from the order executed by a specific core. In fact, the same code can have a different order of operations when executed in parallel on two separate cores.
The order of operations would not be a problem if we were not using threads running on different cores to collaborate with each other. Collaborating threads require us to argue that operation X happened on thread A before operation Y on thread B, like in the example above. Multithreading requires us to be able to talk about causality between operations across threads. Without special tools that wouldn’t be possible.
Atomics are low-level synchronization primitives that allow us to have causality by restricting the order of operations. These primitives need to be processor-level since in addition to restricting the compiler we want to restrict the processor from the cache-level reordering and other things. Atomics give two guarantees:
- we can perform read/write operations on them without fear of torn reads or writes;
- atomic operations have certain guarantees about the order in which they execute, relative to each other, even across the threads. In fact, atomics even enforce the order on non-atomic operations as we will see later.
Each operation on an atomic variable is required to have an ordering type:
Ordering::Release(or their joint alternative
Ordering::SeqCst— short for sequential consistency
You will almost always use
SeqCst which applies the strongest constraints and is the easiest to reason about.
Relaxed applies the weakest constraints and is extremely non-intuitive, so unless you are developing low-level high-performance code you should stay away from it.
Release is a middle ground, in terms of cognitive complexity, but you will still almost never prefer it over
SeqCst. However, understanding
Release is very helpful for the understanding of high-level synchronization primitives, like
As we said before,
Release are hardware-level operations because they generate special instructions for the hardware. One can use
Release in the following way:
Acquire can only be used with load operations, while
Release can only be used with store operations.
Release have the following rules:
Acquire— all memory accesses that happen after it in the code stay after it as visible by all threads (remember that threads B and C can perceive the order of operations on memory done by thread A differently);
Release— all memory accesses that happen before it in the code stay before it as visible by all threads;
So in the following situation, if thread A executes the left code, threads B and C can have all operations on
a inside and swapped, see below. However, they cannot see
a = "Bye" happening before the
Release we can establish causality between the threads. For instance, in the following code, we can assume that if
a must have been set to
Release we can implement a fully-functional spinlock that guards a certain area of the code from concurrent access by several threads:
Note, in addition to the above restriction,
Release operations also have one more obscure rule that prevents spinlocks, like the one above, from interfering. Rust inherits it from the C11 memory model.
Unfortunately, in many situations,
Release are still hard to argue about. Consider the following code:
In this code, it is possible for both messages to be printed, which means threads C and D have inconsistent views on what event happened first. In other words, with
Release there is no global order of operations.
Release merely create transitive causality,
SeqCst establishes a global order of operations. If we replace the above
SeqCst then at most one message will be printed. More formally,
SeqCst follows the rule:
All atomic operations that happen before/after
SeqCstoperation stay before/after it on all threads. Ordinary non-atomic reads and writes may move down across an atomic read, or up across an atomic write.
SeqCst involves emitting a memory barrier (not to be confused with a memory fence) that prevents undesirable reordering. Unfortunately,
SeqCst is more expensive than pure
Release, however, it is still negligible in the global picture and therefore it is highly recommended to use
SeqCst whenever possible.
In the code above we used threads without explaining what they are. In general, there are three things that people call threads:
- hardware threads, a.k.a hyperthreading;
- OS threads;
- green threads.
Hyperthreading is when the processor virtually splits each physical core into two virtual cores, allowing better load distribution. OS threads are created and managed internally by the OS, where each thread executes its own piece of code and they take turns running on the virtual cores. Most OS make the number of threads practically unlimited, unfortunately starting them is expensive since it requires allocation of a stack. Green threads are implemented by the user software, and they run on top of OS threads. The advantages of the green threads are: they work even in the environments that do not have OS thread support; they are much faster to spin-up than regular threads.
Unfortunately, Rust has removed the green threads and now allows only bare OS threads. This was done because green threads were not a zero cost abstraction, which is a cardinal rule of Rust that differentiates it from other languages.
Zero cost abstractions. What you don’t use, you don’t pay for.
And further: What you do use, you couldn’t hand code any better.
- Bjarne Stroustrup
Green threads would require having a heavy runtime that every program would have to pay for, even if it does not use them.
OS threads, however, are not that expensive if one knows how to properly use them. Consider the previous example with the spinlock. The loop will be burning the CPU while it waits for the lock to be released. We can fix it by using
Since the OS might have more threads than virtual cores, there might be another thread that waits to be scheduled.
yield_now tells the OS that it can try running another thread on that virtual core, while the first one waits on its lock.
Mutex and RwLock
In the previous section, we talked about atomics which are low-level synchronization primitives operating on the hardware level.
RwLock are high-level synchronization primitives and they operate on the OS-level. We will not cover
Barrier in this post, since we provide enough background to be able to learn about them from their documentation.
RwLock are similar to the spinlock that we looked at before, with one major difference — while spinlock consumes CPU waiting for the lock,
RwLock release the current OS thread and do not burn CPU. As such they have to operate on the OS-level instead of the pure hardware-level, similar to our modification of the spinlock that yields the thread. However, the major difference between the yielding spinlock and
Mutex is that with
Mutex the OS knows when to wake up the waiting thread once the lock is released, while with the yielding spinlock the OS will sporadically wake up the waiting thread in the hope that the lock is released. Also, the implementation of
Mutex is platform-specific.
RwLocks are similar to
Mutexes because they guard a certain area of code from being accessed concurrently, with several tradeoffs:
Mutexlocks the code from both reads and writes, while
RwLockallows concurrent reads as long as there are no writes, mimicking the borrow checker;
Mutexis a sync-maker. In the next section, we will see what
Synctraits are and will revisit
Safety Through Send and Sync
In the previous post, we saw how Rust provides single-threaded safety, through the borrowing rules and lifetimes.
Sync traits extend this safety into multithreaded applications.
The most important thing to understand about Rust safety is that it only prevents data races and nothing else. Data races happen when one thread writes into an area of memory while another thread reads from it or writes into it, which results in torn reads and writes.
Data races are especially nasty because they can cause undefined behavior. However, their cause is well-defined and therefore it can be detected automatically or prevented on a language level, the way Rust does it.
Race conditions, on the other hand, are semantic errors. For instance, we can incorrectly assume that one event always happens before another. Race conditions break domain logic invariants and are usually the sign of incorrect synchronization or a lack of one. Rust cannot save us from making semantic-level mistakes. In fact, designing such a language would be infeasible. Deadlocks and livelocks are also semantic errors and a result of broken domain logic invariant, e.g. we assume that lock A always happens while holding lock B, but somewhere in our code we implemented it the other way around causing a deadlock. Therefore the only thing we are going to be talking about is how Rust prevents data races.
Sync are meant to prevent data-races.
Sync are automatically derived, unsafe, marker traits.
- Automatic traits are not implemented by the engineers explicitly; instead, the compiler derives them automatically.
Sendtrait marks structs that are safe to send between the threads,
Synctrait marks structs that are safe to share between the threads. The compiler decides that a struct is
Syncif all of its fields are
- Unsafe traits require
unsafekeyword to implement;
- Marker traits do not have methods and are only used to express certain properties of the structs that implement them; for instance,
Eqtrait is another example of a marker trait.
Eqtells that a struct that already implements equality operation can be used as if this operation was reflexive, symmetric, and transitive.
Most of the primitives are
Sync so as a result pretty much all of the types are
RefCell are not
Sync because they implement interior mutability, which means operations on them, if executed concurrently, can cause data races. Also,
Rc is not
Send, because it copies the pointer to the same data and so the threads do not need to share the same copy in order to cause the data race; therefore Rust bans sending
Rc across threads altogether. Interestingly, the way
RefCell tell the compiler about their unsafety is by wrapping their internal fields with
UnsafeCell whose whole purpose is to prevent the automatic derivation of
Rc does not use
UnsafeCell and instead declares itself
A reference of an object that implements
Send and the other way around. In other words,
&T: Send implies
T: Sync and
T: Sync implies
&T: Send. Sending an object between threads is quite common, while sharing is less common. Usually, when we want threads to access the same object we wrap it into a smart pointer, like
Arc, which results in us sending its copies instead of actually sharing it. To share an object we need to share its reference, like this:
which will not work for most cases since
std::thread::spawn only takes closures with static lifetimes. The only way to actually borrow a variable from the stack is to use a scoped thread from a third-party library like crossbeam.
Now, let us talk again about
RwLock. Formally their implementations look like this:
impl<T: ?Sized + Send> Send for Mutex<T>
impl<T: ?Sized + Send> Sync for Mutex<T>
impl<T: ?Sized + Send> Send for RwLock<T>
impl<T: ?Sized + Send + Sync> Sync for RwLock<T>
It means that we can wrap an object
T that only implements
Send and does not implement
Mutex<T> will become both
RwLock though is not a sync-maker. Since several threads can have simultaneous read access to the underlying object, it should be
Mutex prevents any kind of simultaneous access and therefore one can think of that object being sent to the thread that holds the lock.
The following resources were used to compile the post:
- Concurrency chapter from Rustonomicon;
- Zero-Cost Async IO talk by Without Boats about the past and the future of Rust concurrency;
- Atomic <> Weapons is a great talk by Herb Sutter that covers many nitty-gritty details of C++ concurrency;
- C++ Concurrency in Action is a great book by Anthony Williams that gives many useful examples that I was not able to include in this post;
- The stackoverflow concurrency answers by Anthony Williams contain some exerts from his book.
Finally, thanks to Gail Hernandez; Alex Skidanov, Bowen Wang and everyone else in our Near Protocol team. For those interested in Near Protocol: we build a sharded general purpose blockchain with a huge emphasis on usability. If you like our write-ups, follow us on twitter to learn when we post new content:
If you want to be more involved, join our Discord channel where we discuss all technical and non-technical aspects of Near Protocol, such as consensus, economics, and governance:
Jack O’Connor and Gregory Terzian pointed out that
SeqCst only guarantees global order for atomics. Ordinary reads and writes may move down across an atomic read, or up across an atomic write. Also
SeqCst involves inserting a memory barrier, and not a memory fence.
Josh Stone pointed out
&T: Send ↔
T: Sync inaccuracy.