Writing Async App in Scala
Part 3: Threading Model
In previous parts we avoided the question of which
ExecutionContext we should use. Finally, we got to it! In this part I’m going to cover a lot of different topics around multi-threading. Let’s start right away!
How many threads should we use?
- Smaller memory consumption: by default thread allocates 1 megabyte of memory for stack; minimum is different on different platforms, for example, on 64-bit Linux openjdk14 it’s 136K.
- Better resilience as there is no struggle for CPU resources. If you have more, it’s possible that more than CPU number threads will awake at the same time and will try to do something.
- (the most significant one) It brings peace to a soul of a control freak :D
What kinds of operations do we need to perform in our application:
- HTTP or whatever server performs some IO for incoming/outgoing requests.
- RPC/HTTP clients do some IO.
- Database driver does IO.
- And our application code orchestrates it all.
Now we need to decide how do we want to split all these operations among available CPUs.
NB: I left out one more kind of operations, namely — local disk IO (logging, file cache, etc.) I hope I’d be able to write a dedicated blog post about logging in async world.
IO vs application code
What are the differences between the code of our application and other stuff? I mentioned IO not by accident. For example, what does an HTTP server?
- Listen for incoming connection.
- Reads some chunk of a request. This is an async part.
- Parses an HTTP request: it requires some CPU to split raw bytes into nice data object with protocol version, method, headers, etc.
- Passes control to an application: this is our part.
- Writes back a response, which is mostly async.
As you may see, mostly HTTP server waits for bytes to read or that bytes are written. Therefore, it needs very short time slots on the CPU. The same applies to all mostly IO code: HTTP client (which is essentially very similar to server), database drivers, etc.
An application, on the other hand, may do slightly more (just as an example):
- Parse request body as JSON.
- Convert JSON to some domain object.
- Convert domain object to some other object in order to serialize it as JSON and send via RPC.
- Parse JSON of RPC response, mix up with domain object.
- Prepare a query to the database.
- Parse response from database (convert to domain model or even parse JSON, stored in DB).
- Collect everything together, serialize to JSON and send back to a client.
So, an application orchestrates requests to multiple servers, does parsing and serialization (which are CPU-bound operations), converts one representation to another.
Short introduction to ExecutionContext
ExecutionContext is a simple abstraction:
Usually, there is a some kind of thread pool behind
ExecutionContext: one of the ExecutorService provided by JDK (FixedThreadPool, ForkJoinPool, etc.)
Any thread pool consists of 2 parts: queue for incoming tasks and a pool of threads. In order to run
onComplete of a
Future, we need to submit a task (Runnable) to an
ExecutorService. Submit consists of 2 operations: offer to a BlockingQueue and subsequent take from it.
One pool to rule them all
There is a temptation to use a single thread pool for all operations. Having the predefined global ExecutionContext in scala library strengthens this temptation. However, don’t be weak. There is something to consider. When we have a single thread pool, then we have a may encounter this kind of problem:
This is a queue with different tasks in it. Mostly it has short IO tasks, but then a task appears that needs to parse some big JSON. And all short IO tasks will wait for this parsing to finish. If you use some highly optimized HTTP server like netty this is something that is likely to happen, because in netty all IO for each connection is performed in a single thread to avoid any synchronization.
On the other hand, if a thread pool is implemented properly (like in netty), the locality of data in a thread will enable faster execution. For example, when an HTTP server read a chunk of bytes, it sits in a CPU cache, and if parsing code will be executed right after a read operation, it will work with a cached data as opposed to working with a cold memory (for example, see latencies for accessing caches and memory for i7 Skylake.)
I will continue this topic at the end.
Obviously, in async applications we should avoid using blocking code. But what if our database doesn’t have a working async driver? What if the only way to use it is a blocking JDBC?
It’s very important to remember to not mix a non-blocking code with a blocking one. As I showed in a previous part, consequences may be very tragic for responsiveness and overall throughput.
What to do? Use a separate thread pool! What size to choose*? The number of maximum SQL connections you’re willing to use from a single application instance. Is it safe? Let’s try to find out!
The first approach to write a DAO:
Here we create a thread pool with 50 threads, and then we just submit a
Future to this thread pool, and everything is done there. Good? Kind of. It’s working unless… Do you see the part where we need to parse a JSON from a blob column in a database? This part may be problematic. Imagine, our service got a sudden spike, and we got a lot of requests in parallel. It’s quite possible, that at the same time more than a CPU number threads will get a response from a database and will start to parse JSON in parallel. JSON parsing is a CPU-bound operation, so threads will fight for CPU resources which will cause elevated response times, etc. We don’t want it. What to do?
Let’s be a little bit smart about it:
Here we split a query into 2 parts:
- Execute query and extract the blob column. It should be very quick in terms of CPU. So multiple concurrent executions won’t lead to a fight for CPU.
- Parse JSON into a domain object. This part will be executed in another
ExecutionContext, which should be limited to a number of threads less or equal to a number of CPUs.
By doing so we increase the resilience of our application to high traffic. Just to mention, that it’s not a final solution, for peak traffic you still need to scale up your cluster, but this solution will give you time to do that, and servers won’t die.
* There are formulas to calculate feasible number of threads in a blocking application. I wouldn’t worry unless you have about few dozens of threads doing only queries, without any heavy calculations.
If you don’t use ThreadLocal just continue not using it and you may skip this chapter. For everyone else…
Side note: Here I’m talking about thread locals that are used as a context for the request. Other legitimate usages like providing thread safety for non thread-safe instances are fine.
In a blocking application it’s a convenient way of the propagation of aspects that aren’t always related to a business code. There are some caveats, though. In async application it gets ugly(er):
- In async application one request is likely to be handled by multiple threads. At any point when execution moves into async mode (meaning, goes to wait for some event), you have to copy your thread locals and restore it again when the execution resumes. It could be a bit expensive.
- Such implicitness isn’t the best pattern to use. It’s error-prone and obscure (especially for newcomers).
It’s much better to have some kind of a context object that you pass along with the request (you may do it as an implicit argument which is less implicit than thread local).
In guava there is a directExecutor which simply executes
Runnable immediately in the same thread. Same idea in Scala would look like:
implicit val directExecutionContext = ExecutionContext.fromExecutor(
(runnable: Runnable) => runnable.run()
Do we really need it? By definition, it doesn’t provide any concurrency. And still, I’d argue it’s a very important piece of our async application. Let’s understand — why?
Let’s consider this completely foobar-ish example.
What I want to show here is that we don’t always write an optimal code, sometimes we write something like this just because it’s fairly simple and easy to read all in one place. But there’s a micro-problem: each of these functions takes an implicit
ExecutionContext and submits a task to a thread pool. Is it bad? Let’s benchmark it!
The code of benchmark is here. We do a very simple benchmark:
And we measure this piece of code with different execution contexts: global which is backed by ForkJoinPool, single-threaded
ExecutionContext which is backed by Executors.newSingleThreadExecutor and direct
ExecutionContext. The result table is below (units are micro-seconds):
Benchmark Mode Cnt Score Error Units
direct avgt 5 4.056 ± 0.163 us/op
realGlobal avgt 5 62.999 ± 17.757 us/op
realSingleThread avgt 5 18.046 ± 0.722 us/op
smart avgt 5 1.693 ± 0.100 us/op
As you may see, direct wins easily as it doesn’t do any context switching, no queue operations, etc.
global loses to a single threaded pretty significantly, because, yes, both have queue operations, but in case of single-threaded there are no context switches, everything is happening in a single thread.
But what is
smart? It’s a very neat optimization, here is an
Future[T] extension method:
smartMap we check whether
Future is resolved or not. If it’s resolved we can execute a mapping function right away, without submitting it to an
ExecutionContext. As you may see from the benchmark, there is a cost of submitting task to an
ExecutionContext. Here is what is happening on submit effectively (real code is slightly more complicated, you’re welcome to explore it by yourself):
And all this complexity could be avoided by using
Of course, there is a reason, why the default behavior of a
Future is like this. In general case such optimizations (direct executor and smartMap) could lead to a starvation if all operations on futures are too long. However, if we know the load profile of our application we can apply such optimizations for a pretty good benefit.
smartMap. I put it as an example, in my real code I don’t use it, because we don’t have that many possibly-redundant operations, and usage of non-standard extensions would lead to confusion as it would be kind of necessary then to avoid using standard ones, which complicates life much more than gain we get from it.
How to use direct ExecutionContext
Usage of direct ExecutionContext brings some new considerations. Consider this example:
With naive usage of
direct we may encounter very much undesired consequences: execution will jump from one
ExecutionContext to another. It’s all under the assumption that all real async operations have their own execution contexts. If you use single shared
ExecutionContext there is no such problem.
To tackle this problem, what we need to do is to ensure that in all places where we use non-application
ExecutionContext we have to move
Future back to application
By doing this trick the next transformation (submitted to
direct) on a
Future returned by
RpcClient will be executed in an application
ExecutionContext. If you do it everywhere you’ll be able to utilize the superiority of a direct
As a result of using
directExecutionContext we almost never have
implicit ec: ExecutionContext, because in order to make sure that correct
ExecutionContext will be used, we have to completely separate async resources from application code. So,
ExecutionContext for the application code is used along with another
ExecutionContext in a resource just as in an example above. And in all classes that do business logic of the application we can just import
directExecutionContext and concentrate on the business-related code, not on the multi-threading aspect of the code.
Neat side effect
It’s known that stack trace is almost useless in async applications. Let’s find out how badly via simple demo:
global there will stack trace like this:
As you may see, it has much more information. This example has only 2 nesting levels, but it’s clear from this example, that with
directExecutionContext we get the stack trace from the last async operation, which is not as good as in blocking app, but slightly better than non-direct.
The Final Glance
So, at the end, how many threads are we going to use? And how many
ExecutionContexts? Options vary, but there are 2 opposite cases: one thread pool/ExecutionContext for everything and multiple pools, possibly one per each IO entity; and all kinds of permutations in the middle. Let’s explore both opposites.
Single shared thread pool
A single thread pool for the application code and all IO code is possible. For achieving that you need use only non-blocking libraries with an ability to specify (override) a thread pool. In this case everything is pretty simple, you have only 2
ExecutionContexts: direct and an
ExecutionContext backed by the single thread pool for async stuff.
This approach is the simplest in terms of code. No need to think where to submit task: for all async stuff use the real thread pool, for everything else use
directExecutionContext. But, as I stated before, beware of having long tasks that take too much time on the CPU to prevent elevated response time for requests which are ready to be sent back, but waiting in a queue.
Pool per resource
As an extreme case we may assign a dedicated thread pool for each entity like RPC client, HTTP client or database connection pool. Pros would be separation of concerns, absence of single point of failure, no interference between resources. Cons are too many threads to use, which may cause some fighting for resources. But still, if you know the distribution of traffic well and know that it’s unlikely for different resources to be used at the same time, it could be a viable solution as well.
I wouldn’t recommend this approach, as I really don’t like to use more threads than it’s necessary. If possible, group resources by usage patterns and share threads among them.
Another thing to consider is different requirements for different resources. For example, you may have an RPC client with 2 methods. One you call on a user-facing request, another you call in background, for example, to update some in-memory cache. And if a background method can transfer more data, parses bigger objects, you may consider to move it to a separate thread pool to not harm online user-facing requests making it suddenly slower.
Same applies to more important resources vs less important resources. There might be a consideration that more resources should be allocated for the most important transaction. Sometimes even on per user basis to provide different SLA.
In my real application I have this separation:
- Thread pool for the netty HTTP server and all netty-based HTTP clients.
- Thread pool for application code in which most of the business logic is happening along with parsing and serialization of JSON/protobuf/whatever.
- Thread pool for blocking JDBC queries.
- Single-threaded ScheduledExecutorService for timers and periodic background tasks, including slow HTTP client.
The proportion between the netty pool and the application pool is totally based on a planned load on a service. But even a single thread in netty may handle a lot of traffic.
I want to leave you with few points, just to recap:
- Try to minimize a number of threads. Your application’s resilience depends on it.
- Don’t use Await. Never. If you have a
Futurejust propagate it further.
- Isolate blocking code. Use an async alternative if possible.
- Use direct
ExecutionContextin your application code.
All code is available on GitHub.