Rust concurrency patterns: condvars and locks
I’ve spent quite a lot of time extolling the virtues of message-passing in concurrent Rust.
However, there are times when shared-state is the right approach, sometimes just because it’s the historical approach chosen in a module and you want to add something to it without refactoring the whole thing.
So today we’re going to talk about using shared-state, with the help of condvars and locks.
And to remain practical, we’ll use a real-world example from Servo, the HTTP cache used in Fetch.
What in the world is “Fetch”?
“Fetch” is the standardized set of algorithms used by the Web platform to integrate networking.
The standard, constantly evolving by the way, is available at https://fetch.spec.whatwg.org/
I can’t explain it better than the preface, so I’ll just add a screenshot of it below:
Note that fetch is not so much focused on low-level details of networking, instead it focuses on higher-level concepts that are potentially noticeable from a web-page.
So for example, fetch isn’t going to define “TCP”(did anyone know it stands for “transmission control protocol”? I didn’t until today), which is defined at https://tools.ietf.org/html/rfc793 by the “Internet Engineering Task Force”, aka IETF.
HTTP caching itself is also not defined in Fetch, but again by the IETF at https://tools.ietf.org/html/rfc7234.
However, since caching is noticeable to a web-page, as web developers can influence the behavior of HTTP caches through which the resources of a page travel, Fetch will deal with it.
So Fetch does mention how caching should be handled by user-agents, in the aptly named “http-network-or-cache-fetch” algorithm(leaving the low-level details of caching to the IETF).
So, that means Servo’s network stack is an implementation of Fetch, and Servo’s HTTP cache is an implementation of HTTP caching, used according to Fetch.
Servo’s HTTP cache
The cache is discussed in this previous article, and for now let’s just say that it’s implemented basically as a map of cache-key to cached resource, with a RwLock
around this map.
What uses this map? Instances of the Fetch algorithm, potentially running in parallel on a threadpool, from inside the net/resource_thread
component.
Fetches are started in response to messages received from the script
component, and more can be read about its implementation in this previous article.
What if there are multiple fetches for the same resource? For example if one fetch is currently fetching a resource from the network, and another one starts for the same resource?
This second fetch will hit the cache, and receive a “pending resource” on which it can wait, while the first fetch is completing it.
So the cache can already handle multiple fetches for the same resource, using a concept of “pending resources”.
The problem
However, when Servo added capabilities to fetch resources speculatively, a problem surfaced. It turns out that there is a brief moment when a concurrent fetch can miss the cache, even though another fetch for the same cache-key is about to update it with a “pending resource”.
- Fetch 1 has checked the cache, found nothing, and started a networking request for the resource.
- Fetch 1 has not received headers from the network, hence hasn’t put a “pending resource” in the cache yet.
- Fetch 2 checks the cache for the same key as the one used by fetch 1, finds nothing, and goes the network as well.
- Fetch 1 receives headers, puts a pending resource in the cache. It’s too late, fetch 2 already went to the network.
The solution
While a fetch for a given cache-key is waiting on the headers of a network request, we need other fetches for the same cache-key to wait for the cache to be updated with a “pending resource” using those headers, since the worst case is unnecessarily going to the network.
However, given that our cache is a single shared map, we don’t want to always just hold the lock around it until headers have been received, as that would mean all parallel fetches would have to wait in such a situation, regardless of their cache-key.
What was done was to add a secondary map of cache-key to “cache state”, with that state representing whether the cache was ready to read from or not.
The HttpState
is shared by all fetches, and you can see the RwLock<HttpCache>
and the newly introduced Mutex<HashMap<CacheKey, Arc<(Mutex<HttpCacheEntryState>, Condvar)>>>
.
As you can see, the map is of “cache-key” to an Arc
to a tuple of Mutex<HttpCacheEntryState>
, and a Condvar
. You can also see that the map as a whole is protected by a Mutex
as well.
How is this new map used to make relevant fetches wait on an update of the cache?
Let’s go through it step by step.
First, we need to actually get the “http-cache-state” that belongs to the relevant cache-key, based on the current request. This is done like:
So at this point, we essentially got hold of the relevant (Mutex<HttpCacheEntryState>, Condvar)
.
Also, we’ve already dropped the mutex around the map as a whole. It’s a good thing since each Fetch, regardless of what resource it relates to, will have to acquire that mutex at that point, and we don’t want to prevent those parallel fetches from making progress for long, unless they relate to the same cache-key.
Next comes the actual “maybe waiting” logic, using the condvar:
At this point, we acquire the lock on the “http-cache-state”, and check it in a while
loop until the state equals HttpCacheEntryState::ReadyToConstruct
.
You might wonder: if the current fetch is holding the lock and then busy-loops for a particular state, how is the state ever going to change since other fetches will not be able to acquire the lock and mutate it?
Now comes the magic of the condvar. When you call any of the wait_
methods of a Condvar
, passing along a lock guard(the result of acquiring a lock), then the condvar will atomically release this lock and block the current thread. Another thread can then acquire the lock, mutate the state, and finally signal on the condvar. In response to the signal, the condvar will then atomically re-aquire the lock, and wake-up the waiting thread, which will start running where it left off: inside the while loop.
Here’s the great thing about using a condvar: when your code breaks out of the while loop, the state can only be as you would expect it to be from the check at the top of the loop. And also, the thread is holding the lock at that point, so the state cannot change unexpectedly.
Essentially, using a condvar is a very robust way to ensure your thread will only continue if the protected and shared state is in an expected state, and if not, wait until it is.
And the entire logic is encoded right there in the while
loop.
What comes next is that the current fetch, while holding the lock, will check the cache. After having checked the cache, it will update, if necessary, the state protected by the lock.
As you see, while holding the lock on the “cache-state” for a given key, we check the actual cache, and then mutate the state if we’re about to continue to a network request. We only notify on the condvar if the state is worth notifying for.
We then drop the lock. That is actually important, because we want other fetches to be able to check the state, and either go into waiting mode or check the cache if appropriate.
The current fetch then potentially goes to the network, and finally re-acquires the lock on the sate, and potentially updates it. This will look similar to the initial lock acquisition, expect that it will not include a call to wait, instead it might notify on the condvar if appropriate.
(This is done in a function because the fetch algorithm branches at this point and the function can be called from different places).
If the current fetch went to the network, it will update the state to reflect that the cache is closer, or already at, a state where it can be checked again.
Note that because the initial wait uses a timeout, it is possible for more than one fetch to be going to the network in parallell(even for the same cache-key), hence the need to increment/decrement the PendingStore
count.
Also note that in both cases above we use notify_one
. That is because each fetch could put the state back into PendingStore
, in which case it would not make sense for other fetches to wake-up and check it as well. So we wake-up only one fetch, and it will notify the next one if appropriate.
parking_lot
, another implementation of condvar and locks, actually comes with a similar sequential wake-up behavior by default, even when calling notify_all
.
However, even if we were to use parking_lot
, it would in this case still make sense to use notify_one
, because the default sequential behavior of parking_lot
relates to acquiring/releasing the underlying lock(essentially it guarantees several threads will not all wake-up at once and start contending on the lock), whereas in our case, if one thread sets the state back to PendingStore
we don’t want other threads to wake-up at all until it is set back to ReadyToConstruct
. So it makes sense to sequentially wake-up threads “manually” using notify_one
, based on what state the actual underlying protected state is in.
Conclusion
If you can model your concurrency using message-passing and avoid shared-sate, it usually makes for a system that is easier to reason about. However, sometimes, even if just for historical reasons, shared-state can be a good fit. In such a case, condvars can give you a very clear model of “how your threads run based on changes in shared-state”. It’s clear because it’s encoded in an loop that doesn’t leave much to guesswork, and most importantly, the code “looks sequential”, whereas it actually regulates concurrent behavior. It’s actually very close to using channels with an event-loop.
Further reading
A highly recommended article, that, despite a title that doesn’t do it justice, touches upon concurrency as a whole and comes densely packed with advice about almost everything related to it. See in particular the “Alternatives to mutexes” section(to which I cannot directly link, scroll down about half-way).
Also, here is the full PR: https://github.com/servo/servo/pull/24318/