This article continues the series “Channels in Rust”. In the first part we have discussed the channel basics and typical usage patterns. The part two is dedicated to show how channels work, what is inside and how to properly choose and use a channel for a particular task.
Few words about the author
I am using Rust in mission-critical production since 2019, including both high-performance and real-time applications. Our company is called Bohemia Automation, it provides solutions for industrial IoT projects, mostly in high-energy and other heavy industry sectors.
Some numbers
High-load
Instead of a long introduction, let me present some numbers. Consider we have a 16-core machine and performing a stress-test, continuously sending small messages through a channel. There is only a single receiver which receives messages and instantly drops them.
All the channels are limited in capacity. At every test iteration, the number of senders is increased.
Channels used:
- The Rust standard ones
- Crossbeam
- RoboPLC: a lightweight “generic” channel implementation similar to a typical which I use for real-time applications.
I do not provide any source code and even do not provide the exact numbers as we are interested in the dynamics only.
Look at the chart one:
As it can be seen, the simplest implementation (RoboPLC) is quickly degraded when the number of senders raised up to two. However, after about 4 senders it stopped to degrade and provided the same numbers (about 700k messages a second) up to the end of the test. Crossbeam was the winner, things got worse with the standard one.
But let us look at the chart two, what happens when the number of senders is increased even more:
After about 20 senders (I remind you, we are using a 16-core machine) there is “a miracle” — crossbeam starts degrading so quickly! And after 25 producers, things start to be really problematic for the leader.
(I stopped tests for the std channel implementation after 20 senders as the performance was really awful and continued to degrade with every step).
Another “miracle” — the simplest implementation (RoboPLC) still provides the same numbers with no degrade.
Real-time
Let us simplify the test and measure a tiny real-time application. It has got only two senders and a single receiver and both are running on the same CPU (a typical situation to get as much L1 hits as possible) with real-time thread priorities.
We do not really care about specific system scheduler settings for the moment. Let us repeat the stress test. Here is no chart as no conditions are changed.
The numbers:
- crossbeam: 65k messages a second, 7 882us average latency
- std: 173k messages a second, 2 971us average latency
- roboplc: 861k messages a second, 597us average latency
Some theory
So what happens? How can a simple channel provide such amazing results in specific conditions?
To answer this question, let us pick up some channel theory.
A channel implementation is well described in the amazing book “Rust Atomics and Locks”, Chapter 5 by Mara Bos. I definitely recommend reading it to all who are interested how the Rust things work inside.
The overall channel algorithm can be represented with the following scheme:
It does not really matter:
- how the buffer is locked, it is usually a relatively cheap operation
- how the buffer is organized: an advanced super-optimized linked list or a standard VecDeque (by the way, some high-performance implementations just use the second one, having no shame about)
It also does not really matter how the thread is stopped/parked/frozen. It is just always the most expensive operation.
To compare: a standard mutex is locked on a modern CPU in less than a microsecond (if nobody else is holding it), while thread parking/unparking can take up to 100us.
Spin loops and the Exponential backoff
Most of channels (and everything else: mutexes, semaphores, etc.) avoid parking threads at all costs. For that, they use spin loops, usually with a variation of the Exponential backoff algorithm.
Despite I have provided the link to Wikipedia. To make the long story short: an exponential backoff is an algorithm which increases the waiting time after every unsuccessful attempt. For example, we need to download a file, but the server is down. We will perform the second attempt in a second, the third one in 2 seconds, the fourth one in 4 seconds and so on, stopping increasing the interval after it reaches e.g. 32 seconds. At this point we should either continue with the maximum interval or give up.
A typical algorithm for a channel looks similar to the following:
- Raise the lock.
- Spin a couple of CPU cycles with a hint. The hint does basically nothing, however it is always a rule of good taste to tell the CPU we are in a spin loop and let it optimize some internal things.
- Get the lock back, maybe there is some data (for a consumer) or a free slot (for a producer).
- If not — exponentially increase the number of CPU cycles.
A clear backoff implementation for synchronous tasks can be found in the crossbeam crate (despite the one used for their channels is slightly different).
// the crossbeam crate exponential backoff code (with my comments)
fn snooze(&self) {
// the values in such functions are chosen empirically,
// to fit the majority of typical tasks
if self.step.get() <= SPIN_LIMIT {
for _ in 0..1 << self.step.get() {
// CPU spin loop
std::hint::spin_loop();
}
} else {
// yielding to the OS
std::thread::yield_now();
}
// after some limit the backoff is considered to be "completed" and
// the caller should give up and park its thread
if self.step.get() <= YIELD_LIMIT {
self.step.set(self.step.get() + 1);
}
}
As it is seen, at some point spin loops can be mixed with thread yielding. Creativity has no limits: I have found a backoff which was using thread::sleep
calls (that was obviously not the best choice as when fell asleep, a thread is actually parked by OS). However sooner or later the parking becomes inevitable, so smart backoff implementations never perform longer than about 10-20% of the time, required for stopping/resuming the thread.
The exponential backoff in channels is a kind of gambling but it works surprisingly amazing in high-loaded applications if used properly. The other side of spins is that they usually do not respect any queue and latency of individual messages can be unpredictable.
Let us explain the numbers
Now, when we know how the channels work, we can explain the numbers I have provided at the beginning.
Both Rust standard and crossbeam use a variation of the Exponential backoff algorithm. RoboPLC does not as it is a specific lightweight channel for real-time applications. Both standard and crossbeam started to degrade as soon as the number of senders increased higher than CPU cores in the system.
In the second test, where both senders and the receiver were forcibly put to a single CPU, the channels with backoff loops have shown an awful performance right from the start.
Despite there are also spin loops inside channel locks, you should not really care about them as the locks in channel algorithms are usually raised pretty quickly.
So, the “miracle” can be finally explained: in the both tests the senders with spin loops started to block the receiver. And the receiver started to block the senders.
As soon as they happened to be on the same CPU core. No magic and no rocket science.
Conclusion
- All channels are for different purposes. Always choose your channel wisely.
- Despite being relatively slower, the standard Rust channels can be considered as a well-balanced choice.
- For “traditional” high-performance channels: it is a good practice to limit the number of producers/consumers to CPU cores in the system. Do not clone channel endpoints continuously, use a pool of workers or some other suitable pattern, as it was described in the first part of this article.
- In real-time applications, when threads are forcibly put to a limited number of CPU cores (especially on embedded machines with a limited number of CPUs) it is better to avoid using high-performance channels: they are optimized for big machines and high throughput, not for small ones and low latency.
- Spin loops are not dead-locks, however in certain circumstances they can seriously degrade performance of your application.
- Limit your channels but always try to give them enough capacity to let at least senders avoid parking. CPU resources are usually much more expensive than RAM and increase latency of your data pipes.
- Always test your tools in real production. Remember about operation costs.
What about asynchronous channels?
Well, in asynchronous environments there are no special tricks usually performed. A channel method can either return Poll::Pending
or Poll::Ready
. And the only thing it can really do: pray for the asynchronous runtime it is running with.
Part 3?
Maybe, waiting for your feedback. Like, subscribe, share, comment, enjoy Rust. It’s amazing.