Is Your Thread Working as Hard as you Think it is?
Asynchronous is NOT about multiple background threads, but about making more efficient use of the current thread.
Question Everything
I use async-await in C# all day every day, I rarely pay close attention to the effect on performance, some subtle, some very noticeable, depending on how we use async-await. I never felt the need to question or verify what I learned on Microsoft documentation which I think they did a fantastic job in recent years keeping those documentation up to date and high quality.
Until recently in a conversation with a friend of mine, who by the way is a brilliant programmer, he shared some of the design ideas behind an API he built that is achieving a phenomenal performance result. Apart from caching heavily, the way he organises async tasks caught my attention: async tasks are fired off in one go, then await for them, as one task finishes then process result straight away, and another task finishes then process result straight away, etc. Basically eagerly process completed tasks while we wait for other tasks to complete.
This has opened up a series of questions I want answers for: How does eager async-await processing collapse overall processing time? what is the best approach to use async-await? Is there a best approach after all? Is waiting for tasks one by one really that bad? Is Task.WhenAll() really the most efficient for the majority of use cases? (it seems to be advertised as such in some of the Microsoft documentation)
You know and I know there is only one way to find out — code it and test it!
Flight Simulator
I am not going to write a Flight Simulator here to test my hypothesis. Rather, I would like to keep the complicatedness of my test code way down so we can see the essence of async pattern. However, I still keep some attributes more or less realistic but smaller in scale compared to what is in production code. Because these attributes will greatly influence how we structure our code.
Okay let’s say we have three API requests. To simulate the async nature of http, I use Task.Delay(). Simple enough — I take a measurement before and after the continuation which logs time and thread information.
Couple of things to note:
- These three API calls return a different type. As we will see later, returning multiple types will greatly influence how we use delegate, local function, etc, to construct async tasks.
- Three calls last for a different duration. Because anticipating call duration and placing its result processing accordingly could also have a noticeable impact on performance .
After the API call returns, we process the result according to its type. I use Thread.Sleep() to simulate the fact that result processing normally is a synchronous operation. Again, it is important to take note the different durations of these processing, which will also affect performance depending on the orders we place them.
If you know the async durations and know yourself, you need not fear the result of a hundred async battles
Let’s run the first commonly used async approach. We kick off all async operations in ‘one go’, wait for them to finish, then process result one at the time.
In the code above, API call 1 takes the longest to complete and we place await t1 as the first await, this means waiting for the longest call to complete before any result processing happens even API call 2 & 3 may(should) have finished long ago! As we can see below illustrated as the idle period:
Let’s place processing API 3 & 2 before API 1, so we can start processing them while we wait for API 1.
Wow! We cut the time from 488ms down to 388ms, by 100ms, that is over 20% reduction! Since we already know the durations for each API calls and we also know how long it takes to process each of them, we then purposefully arrange(tune) the order of sync operations and their result processing for greater performance. This is cheating! Or is it really?
In real production code, I suggest benchmark every async operation including its setup, async and result processing, then graph them out, so that we can visually reason about various performance tuning hypothesises, e.g. if task is really quick but its result processing takes long time, then we might try kicking it off as early as possible it will finish the earliest so it buys us more time to process it while we wait for the longest call to complete. Then we go through the cycle of measure, tune, measure, and tune, … so to reduce idle as much as possible.
Is Task.WhenAll() really for All?
Before we start please note that Task.WhenAll() does not start tasks simultaneously. Tasks already started but NOT simultaneously before we line them up in Task.WhenAll().
Task.WhenAll() is one of async APIs that I use a lot especially when the results of those async tasks are correlated and need to be processed in unification. A lot of articles praise Task.WhenAll()’s superior performance. However, one thing I find in some of the articles is the test code used is too simple to be realistically comparable to production situation. For this very reason, I distinguish async-operation and its result processing, and each async result is a different type. let’s see how does Task.WhenAll() stand for our test.
What?! It is slower than our manual tuning!
Here is why: when the results of async operations are dependant on each other, e.g. to process result 1 we need result 2, to process result 2 we need result 3. Then we have no choice but take the penalty to wait for the slowest task to finish, then process sequentially.
If the results of async operations are unrelated, we can use a more performant version of Task.WhenAll() — packaging async operation and its result processing into a single unit of work, then line them up in Task.WhenAll():
This approach achieves similar result as our manual tuning approach. It does NOT however, require us to know the wait-time beforehand for each async operations. Each async results are processed eagerly straight after async completion, the total time spent is equal to the longest duration of the async & result processing unit.
Looks like we have struck gold with this approach as it reduces idle time almost automagically! It is too good to be true! Let’s stress test it see what happens:
Ran 2 async calls and something caught my attention: process result 1 & 2 started at the same time! This can only mean one thing: concurrent threads and my log confirms that:
Let’s increase async calls to 10. What?! 7 concurrent threads were used!
What about 10000 async calls?
You may argue surely no one is doing 10000 async operations in their production code, but what about on a server farm each machine could have hundreds of microservices?
How about we fix the number of async calls, but vary the duration for async and result process?
What I find is when I decrease the durations, it does not decrease the number of concurrent threads used. When I increase the durations, the number of concurrent threads only increases slightly.
So the number of concurrent threads increases in proportion to the number of tasks. The duration of the tasks however, does not seem to affect concurrency that much. For which the TaskScheduler is doing the correct thing as thread concurrency conceptually only makes sense when it comes to multi tasking. If a task(async or non-async) is taking a long time to complete, adding more threads will not help shorten the task, because a task as a unit can only be attended by one thread!
In real production code, I suggest only use Task.WhenAll() if you have a small number of tasks especially if your app is on a server farm. When used, pack async operation and its result processing in one single unit. Although the length of operation does not incur increasing thread concurrency, however, Task.WhenAll() uses more concurrent threads compared to manually configured async approach where both achieve very similar result.
Process Tasks as They Complete
Stephen Toub blogs about a technique that eagerly processes async result. In theory this should be the fastest, let’s find out:
Although in theory, this approach sounds promising, but in my tests, it never gets as fast as the manual tuning or Task.WhenAll() approach. As Toub points out in his article “if the number of tasks is large here, this could result in non-negligible performance overheads”. The overheads he refers to is the continuation registering in whenAny() and the tasks.Remove().
I am not convinced, what if we combine async and its result processing together?
Combining async and its result processing according to my tests, it is faster than the first Task.WhenAny() approach. And this improvement becomes more pronounced as the length of the async operation & its result processing increases. However, it still cannot beat Task.WhenAll() or manual tuning.
“process tasks as they complete” approach works a lot more like Task.WhenAll() approach of which we pack async & result processing in one single unit of work. The difference is with Task.WhenAny() there will be tax on removing completed task and re-registering task continuations in every single loop! So in production code, I suggest use it when you only care any one of the tasks completed and don’t need to wait for all. Or in scenario that the durations of tasks cannot be known, but in this case might as well just use Task.WhenAll() because it is quicker.
Conclusion
Manual tuning gives us the best performance and incurs the lest penalty on thread concurrency. However it does require us to know the durations of async operations beforehand.
Task.WhenAll() comes as second when we pack async operation and its result processing in one single unit of work. It does not require us to have any knowledge of the durations of sync operations in advance. It does however, use more concurrent threads and the number of concurrent threads could increase dramatically depending on the timing of our async operations as well as underlying management of TaskScheduler.
Process tasks as they complete with Task.WhenAny() sounds promising in theory but it is actually the slowest in practice due to all its house keeping operations.
With manual tuning, we almost always want to keep async phases(setup, async and result processing) separated so that we have full control of where to place each time-block to best reduce idle. When use Task.WhenAll() and Task.WhenAny() however, we almost always want to pack async phases in a single unit of work whenever we can. But be aware of the number of concurrent threads and we just have to trust TaskScheduler do its job right!
I know it is a long read, so if you have reached thus far, thank you! And as always please leave a comment for your thoughts.
Reference
Benchmark project source code used in this article
Processing tasks as they complete
Asynchronous programming with async and await
Start Multiple Async Tasks and Process Them As They Complete
When to use Task.Delay, when to use Thread.Sleep?
Using async/await and Task.WhenAll to improve the overall speed of your C# code