Within the last couple of days of our global solitary retreat, I’ve stumbled by chance upon the Single Writer Principle™ concept, first via an article on Kafka, which mentioned it in passing, and then via this excellent article zooming in on it, and it was one of those “I’ve been looking for you for at least the past three weeks, where you’ve been hiding?” kind of moment.
So I’m not really a low-level concurrency expert, my focus is rather on high-level business logic type of designs, yet I’m amazed how often when you do dig into the lower-level aspects of CPU caches and the likes, what you find is usually that what “makes sense” from a high-level perspective, usually also turns out to be “doing the right thing” from a lower-level perspective as well.
The “Single Writer Principle” is a great example of this kind of engineering serendipity.
For the lower-level details, and a excellent high-level description as well, head over to https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html.
Here, I will instead provide a concrete example, in Rust, of a “mini system” implemented with that principle in mind. We’ll also look at how this kind of design makes your concurrent system very testable, and about the differences between blocking and non-blocking workflows.
We’ll do this by implementing a concurrent version of one half of the system described in the Kafka article, see below:
By the way, something else that I find amazing is how despite their differences, from a high-level perspective, distributed and concurrent systems look exactly the same. So, in the Kafka example, the “messaging” is implemented via Kafka, which is off-course a completely different beast from say, Crossbeam, however the way you design those high-level components and workflows is basically the same.
Not only that, but they’re in fact simply “the same thing” at different scales, one might say, since a single node of a “service” in a distributed system is likely to itself be internally concurrent, and consist of various concurrent “services”. Think about that for a moment(not like you have anything else to do these days)…
Our very own “single-writer” system, in Rust
This system will consist of three “services”:
- The Basket service, which will request orders, and wait for them to be confirmed by
- The Order service, handling incoming order requests from the Basket service, validating them, and sending related payment requests to the
- Payment service, validating payment requests, and confirming them with the Order service, which will then confirm the order to the Basket service.
We’ll wrap the whole thing in a unit test, with the Basket service simply being represented by the “main thread” of it.
Later we’ll also add a “fraud service”, used by the payment service, to investigate blocking versus non-blocking workflows.
What’s a single-writer anyway?
The article mentions that:
Unfortunately most use queue based implementations underneath, which breaks the single writer principle, whereas the Disruptor strives to separate the concerns so that the single writer principle can be preserved for the common cases.
So in this example we’ll be using channels from Crossbeam, does that break the single writer principle? The quick answer is: I don’t know. I think the implementation is entirely lock-free, however I don’t know it well enough to be sure whether that means it’s absolutely free of writer contention in the common case like the Disruptor design linked-to above.
In this example we will not be cloning any senders, hence using the channels as SPSC queues. If we were to clones senders, I can imagine the principle would be almost certainly broken.
I do know that in any case, we still get to keep all of the “business logic-ky” state inside the single-thread that runs an event-loop and operates on local state in response to receiving a message. So that gives us “single writer-ness” at least for that part.
And that’s a good start that this post will focus on.
So let’s get to it.
Full code examples are at https://github.com/gterzian/single-writer
Let’s go through the initial setup of all three services.
First of all, let’s have a look at the data being passed around services, which I hope speaks for itself:
And now, the “basket” service:
This runs on the main thread of the unit test, and it basically sends four orders, for four different customers, to the order service, and then it waits on confirmation for these orders and shuts down.
Note that his service uniquely owns a map of “customer -> order count”.
The “order” service:
This one selects on two channels, one from the basket service, and another one from the yet unknown “payment” service.
When it receives a new order request for a customer, it keeps track of it in a map of “order -> customer”, and sends a payment request to the payment service.
When it receives a result from the payment service, it confirms the order to the basket service.
And when it receives a “shutdown” message from the basket service, it, well, shuts down, after having sent a similar message to the payment service.
Finally, the payment service:
Which currently simply approves every payment request immediately.
Taking a step back
So, remember that this article is about the “single writer principle”, and if we take a step back and look at this mini-system, we can see that it seems to follow that principle: each service uniquely owns the state it requires to run it’s own(very simple) business logic, and when a given service needs to “affect the world” of another service, it does this by sending a message.
For example, it’s not the “payment service” that “confirms an order”, it’s the “order service” that does that, in response to having received a “payment confirmation” from the “payment service”.
So, as a result, in a very practical kind of way, if you want to know what’s going on with a given “order”, all you need to do is add a
println at the select of the “order service”, allowing you to inspect the flow of messages, both the request for new orders, and the payment results, which affect the local state of an “order”.
There is no concept of back-pressure between those services, meaning the unbounded channels used in their communication could fill up if services have different runtime characteristics, and for techniques to handle that see a previous article.
To block or not to block, that is the question
As a little bonus, since we’ve already have all this code setup, let’s take a look at a concrete example of a blocking, versus non-blocking, workflow.
So let’s imagine that our payment service needs to “check for fraud” when a customer makes their first order, but can allow already checked customers to do all as many additional transactions as they want(what a great fraud detection algorithm).
And let’s imagine that this “check for fraud” operation is fairly expensive for some reason, which we’ll model with a simple “sleep” for now.
So let’s make a first attempt to add this operation to the workflow of the payment service: https://github.com/gterzian/single-writer/commit/753ca119042df779e99172a10185ee0c388a5fbf
As you can see, this “check for fraud” operation dramatically reduces the through-put of the payment service as a whole: while it is performing the check, it can’t handle other requests, even for orders that do not require it.
So there’s an opportunity here: we could move this operation into a dedicated “fraud service”, running in parallel to the payment service, allowing the payment service to handle “existing customers” at full speed, while the fraud service can focus on making the expensive check for “new customers”. So while this will not make the check in itself faster, a “new customer” will have to wait as long as before to see an order confirmed, however “existing customers” will not be blocked by a “new customer” ahead of them in the queue.
Let’s make a first attempt: https://github.com/gterzian/single-writer/commit/2f2c814b76b7c545280342d723b7ac3822f79b8e
This is an example of a “blocking workflow”: while the payment and the fraud service are running in parallel of each other, the payment service will “block” on the result of the fraud check.
The result is that we’re not getting any additional throughput, it’s basically the same as when we were running the fraud check operation inside the payment service.
The offending code is found at:
Note that such a blocking workflow can totally make sense, in some cases, from a business logic perspective. For example, if we had a rule that said something like: if we get an certain type of order, we must drop everything and check for fraud first, and not handle any other orders while that check is ongoing.
And now, for the non-blocking version: https://github.com/gterzian/single-writer/commit/884142de374937cd5f4b7e189c08a3035e25d4d9
This time, the payment service selects on both the payment request, and a new “fraud check result” channel, and when an order requires a fraud check, the service notes the pending check in a piece of local state, sends the request to the fraud service, which will perform the work in parallel and eventually send the result back to the payment service.
Importantly, while such a check is ongoing, the payment service can continue to process other orders.
As I noted in a previous article, the blocking vs non-blocking dichotomy is equally important when using async/await and tasks. We’re using threads here, but if we were to use tasks to model the various services, if the “payment service” were to “await” a fraud check, it wouldn’t be able to handle other payment requests in the meantime either.
While the thread on the thread-pool where the task was running would not be blocked, that doesn’t help the payment service with processing other requests while the fraud check is ongoing. In the current example, “blocking a thread” also doesn’t block the core on which it is running, yet the fact that the OS can schedule other threads on it doesn’t give us any higher throughput when it comes to handling payment requests.
So in both cases, if you want a high throughput, you need to ensure it yourself in some way by making the business logic itself non-blocking.
Coming back to the single writer business, and testability
As you can see, in the final example, we’re still following the principle outlined earlier: the payment service simply owns a new piece of data, allowing it to “track” pending fraud requests, and manipulates that piece of data in response to receiving messages from other services.
Also, it’s worth noting how testable that system is. The current example basically runs the whole thing as an integration test, however each service could be tested independently, simply by mocking other services by having the test use their channels to send messages to a service(the test “input”), and then by making assertions about what that service outputs on the various channels meant for other services.
Finally, here is the example as a whole: