Rust parallelism for non-C/C++ developers

Published in

NEAR Protocol

12 min readMay 2, 2019

The bare minimum.

The majority of people coming to Rust have C/C++ background which allows them to easily transition into Rust parallelism since it is so similar. However, for many people coming from other languages, it is a challenge. In this post, we will walk through the standard Rust parallelism tools as well as the motivation behind them. This will require a hardware deep dive at the beginning, followed by an explanation of the low-level tools, like atomics, and ending with an explanation of high-level tools like Mutex. Finally, we will explain how Rust guarantees safety in multi-threaded applications.

In Rust when you hear people talking about parallelism and concurrency it is mostly about the frameworks because Rust as a language does not favor any specific parallelism or concurrency abstraction and instead provides the bare minimum, like standard threads and several synchronization primitives. This bare minimum is what we will be exploring in this post.

Why parallelism is hard

First, we need to understand why parallelism is hard, and the reason is that the hardware, the OS, and the compilers are too complex. Since 1970 processor cores do not work with the memory directly and instead use a complex hierarchy of caches and write buffers.

We don’t even need to review the entire hierarchy to understand why it is hard. Let us remove all the cache and consider the write buffers only. Write buffers are absolutely essential for the performance of the processor since writes to the memory are expensive and we want to batch them as much as possible.

Consider the following program that we run on two cores. The program has a critical section that we do not want to be executed on both cores simultaneously. One of the ways to ensure that is to use intent flags. Intent flags are used by the cores to declare their intent to enter the critical section. Logically, if one of the cores has entered the critical section then their intent flag is non-zero and another core will not enter it. However, if both cores wrote their intent flags without flushing the buffer then they will both enter the critical section because they will read the 0 values of intent flags from the memory.

Another way of thinking about the problem is to think that the write buffer has reordered the order of operations by executing flag2 != 0 before flag1 = 1. Similarly, we can think that caches also reorder the operations.

Operations are also reordered by the compiler that does optimizations, like subexpression elimination and by the processor that does prefetching and speculation, among other things. As a result, the order of operations in the source code will be different from the order executed by a specific core. In fact, the same code can have a different order of operations when executed in parallel on two separate cores.

The order of operations would not be a problem if we were not using threads running on different cores to collaborate with each other. Collaborating threads require us to argue that operation X happened on thread A before operation Y on thread B, like in the example above. Multithreading requires us to be able to talk about causality between operations across threads. Without special tools that wouldn’t be possible.

Low-Level Primitives

Atomics are low-level synchronization primitives that allow us to have causality by restricting the order of operations. These primitives need to be processor-level since in addition to restricting the compiler we want to restrict the processor from the cache-level reordering and other things. Atomics give two guarantees:

we can perform read/write operations on them without fear of torn reads or writes;
atomic operations have certain guarantees about the order in which they execute, relative to each other, even across the threads. In fact, atomics even enforce the order on non-atomic operations as we will see later.

Each operation on an atomic variable is required to have an ordering type:

Ordering::Relaxed
Ordering::Acquire and Ordering::Release(or their joint alternative Ordering::AcqRel)
Ordering::SeqCst — short for sequential consistency

You will almost always use SeqCst which applies the strongest constraints and is the easiest to reason about. Relaxed applies the weakest constraints and is extremely non-intuitive, so unless you are developing low-level high-performance code you should stay away from it. Acquire/Release is a middle ground, in terms of cognitive complexity, but you will still almost never prefer it over SeqCst. However, understanding Acquire/Release is very helpful for the understanding of high-level synchronization primitives, like Mutex and RwLock.

Acquire/Release

As we said before, Acquire/Release are hardware-level operations because they generate special instructions for the hardware. One can use Acquire/Release in the following way:

Acquire can only be used with load operations, while Release can only be used with store operations. Acquire and Release have the following rules:

Acquire — all memory accesses that happen after it in the code stay after it as visible by all threads (remember that threads B and C can perceive the order of operations on memory done by thread A differently);
Release — all memory accesses that happen before it in the code stay before it as visible by all threads;

So in the following situation, if thread A executes the left code, threads B and C can have all operations on a,b,c inside and swapped, see below. However, they cannot see c = "Bye" happening before the Acquire.

With Acquire/Release we can establish causality between the threads. For instance, in the following code, we can assume that if b is true then a must have been set to true also.

Using atomic Acquire/Release we can implement a fully-functional spinlock that guards a certain area of the code from concurrent access by several threads:

Note, in addition to the above restriction, Acquire/Release operations also have one more obscure rule that prevents spinlocks, like the one above, from interfering. Rust inherits it from the C11 memory model.

Sequential consistency

Unfortunately, in many situations, Acquire/Release are still hard to argue about. Consider the following code:

In this code, it is possible for both messages to be printed, which means threads C and D have inconsistent views on what event happened first. In other words, with Acquire/Release there is no global order of operations. Acquire/Release merely create transitive causality, SeqCst establishes a global order of operations. If we replace the above Acquire/Release with SeqCst then at most one message will be printed. More formally, SeqCst follows the rule:

All atomic operations that happen before/after SeqCst operation stay before/after it on all threads. Ordinary non-atomic reads and writes may move down across an atomic read, or up across an atomic write.

In RustSeqCst involves emitting a memory barrier (not to be confused with a memory fence) that prevents undesirable reordering. Unfortunately, SeqCst is more expensive than pure Acquire/Release, however, it is still negligible in the global picture and therefore it is highly recommended to use SeqCst whenever possible.

High-Level Primitives

In the code above we used threads without explaining what they are. In general, there are three things that people call threads:

hardware threads, a.k.a hyperthreading;
OS threads;
green threads.

Hyperthreading is when the processor virtually splits each physical core into two virtual cores, allowing better load distribution. OS threads are created and managed internally by the OS, where each thread executes its own piece of code and they take turns running on the virtual cores. Most OS make the number of threads practically unlimited, unfortunately starting them is expensive since it requires allocation of a stack. Green threads are implemented by the user software, and they run on top of OS threads. The advantages of the green threads are: they work even in the environments that do not have OS thread support; they are much faster to spin-up than regular threads.

Unfortunately, Rust has removed the green threads and now allows only bare OS threads. This was done because green threads were not a zero cost abstraction, which is a cardinal rule of Rust that differentiates it from other languages.

Zero cost abstractions. What you don’t use, you don’t pay for.
And further: What you do use, you couldn’t hand code any better.
- Bjarne Stroustrup

Green threads would require having a heavy runtime that every program would have to pay for, even if it does not use them.

OS threads, however, are not that expensive if one knows how to properly use them. Consider the previous example with the spinlock. The loop will be burning the CPU while it waits for the lock to be released. We can fix it by using yield_now:

Since the OS might have more threads than virtual cores, there might be another thread that waits to be scheduled. yield_now tells the OS that it can try running another thread on that virtual core, while the first one waits on its lock.

Mutex and RwLock

In the previous section, we talked about atomics which are low-level synchronization primitives operating on the hardware level. Mutex and RwLock are high-level synchronization primitives and they operate on the OS-level. We will not cover Channel, CondVar, and Barrier in this post, since we provide enough background to be able to learn about them from their documentation.

Mutex and RwLock are similar to the spinlock that we looked at before, with one major difference — while spinlock consumes CPU waiting for the lock, Mutex and RwLock release the current OS thread and do not burn CPU. As such they have to operate on the OS-level instead of the pure hardware-level, similar to our modification of the spinlock that yields the thread. However, the major difference between the yielding spinlock and Mutex is that with Mutex the OS knows when to wake up the waiting thread once the lock is released, while with the yielding spinlock the OS will sporadically wake up the waiting thread in the hope that the lock is released. Also, the implementation of Mutex is platform-specific.

RwLocks are similar to Mutexes because they guard a certain area of code from being accessed concurrently, with several tradeoffs:

Mutex locks the code from both reads and writes, while RwLock allows concurrent reads as long as there are no writes, mimicking the borrow checker;
Mutex is a sync-maker. In the next section, we will see what Send and Sync traits are and will revisit Mutex and RwLock.

Safety Through Send and Sync

In the previous post, we saw how Rust provides single-threaded safety, through the borrowing rules and lifetimes. Send and Sync traits extend this safety into multithreaded applications.

The most important thing to understand about Rust safety is that it only prevents data races and nothing else. Data races happen when one thread writes into an area of memory while another thread reads from it or writes into it, which results in torn reads and writes.

Data races are especially nasty because they can cause undefined behavior. However, their cause is well-defined and therefore it can be detected automatically or prevented on a language level, the way Rust does it.

Race conditions, on the other hand, are semantic errors. For instance, we can incorrectly assume that one event always happens before another. Race conditions break domain logic invariants and are usually the sign of incorrect synchronization or a lack of one. Rust cannot save us from making semantic-level mistakes. In fact, designing such a language would be infeasible. Deadlocks and livelocks are also semantic errors and a result of broken domain logic invariant, e.g. we assume that lock A always happens while holding lock B, but somewhere in our code we implemented it the other way around causing a deadlock. Therefore the only thing we are going to be talking about is how Rust prevents data races.

Send and Sync are meant to prevent data-races. Send and Sync are automatically derived, unsafe, marker traits.

Automatic traits are not implemented by the engineers explicitly; instead, the compiler derives them automatically. Send trait marks structs that are safe to send between the threads, Sync trait marks structs that are safe to share between the threads. The compiler decides that a struct is Send/Sync if all of its fields are Send/Sync;
Unsafe traits require unsafe keyword to implement;
Marker traits do not have methods and are only used to express certain properties of the structs that implement them; for instance, Eq trait is another example of a marker trait. Eq tells that a struct that already implements equality operation can be used as if this operation was reflexive, symmetric, and transitive.

Most of the primitives are Send/Sync so as a result pretty much all of the types are Send/Sync, except Rc, Cell, and RefCell. Rc, Cell, and RefCell are not Sync because they implement interior mutability, which means operations on them, if executed concurrently, can cause data races. Also, Rc is not Send, because it copies the pointer to the same data and so the threads do not need to share the same copy in order to cause the data race; therefore Rust bans sending Rc across threads altogether. Interestingly, the way Cell and RefCell tell the compiler about their unsafety is by wrapping their internal fields with UnsafeCell whose whole purpose is to prevent the automatic derivation of Sync trait. Rc does not use UnsafeCell and instead declares itself !Send and !Sync explicitly.

A reference of an object that implements Sync is Send and the other way around. In other words, &T: Send implies T: Sync and T: Sync implies &T: Send. Sending an object between threads is quite common, while sharing is less common. Usually, when we want threads to access the same object we wrap it into a smart pointer, like Arc, which results in us sending its copies instead of actually sharing it. To share an object we need to share its reference, like this:

which will not work for most cases since std::thread::spawn only takes closures with static lifetimes. The only way to actually borrow a variable from the stack is to use a scoped thread from a third-party library like crossbeam.

Now, let us talk again about Mutex vs RwLock. Formally their implementations look like this:

impl<T: ?Sized + Send> Send for Mutex<T>
impl<T: ?Sized + Send> Sync for Mutex<T>

and

impl<T: ?Sized + Send>        Send for RwLock<T>
impl<T: ?Sized + Send + Sync> Sync for RwLock<T>

It means that we can wrap an object T that only implements Send and does not implement Sync into Mutex and Mutex<T> will become both Send and Sync. RwLock though is not a sync-maker. Since several threads can have simultaneous read access to the underlying object, it should be Sync. Mutex prevents any kind of simultaneous access and therefore one can think of that object being sent to the thread that holds the lock.

Credits

The following resources were used to compile the post:

Concurrency chapter from Rustonomicon;
Zero-Cost Async IO talk by Without Boats about the past and the future of Rust concurrency;
Atomic <> Weapons is a great talk by Herb Sutter that covers many nitty-gritty details of C++ concurrency;
C++ Concurrency in Action is a great book by Anthony Williams that gives many useful examples that I was not able to include in this post;
The stackoverflow concurrency answers by Anthony Williams contain some exerts from his book.

Finally, thanks to Gail Hernandez; Alex Skidanov, Bowen Wang and everyone else in our Near Protocol team. For those interested in Near Protocol: we build a sharded general purpose blockchain with a huge emphasis on usability. If you like our write-ups, follow us on twitter to learn when we post new content:

http://twitter.com/nearprotocol

If you are an entrepreneur in web3 space, consider joining our accelerator with mentors from tier1 companies and investment firms in the space: http://openwebcollective.com/. We help with all the aspects of building the business, including identifying the market, BD, marketing and fundraising.

Edits

Jack O’Connor and Gregory Terzian pointed out that SeqCst only guarantees global order for atomics. Ordinary reads and writes may move down across an atomic read, or up across an atomic write. Also SeqCst involves inserting a memory barrier, and not a memory fence.

Josh Stone pointed out &T: Send ↔T: Sync inaccuracy.