Threads Vs Queues

And the story of scaling a single-threaded application (6M/sec)

Omar Elgabry
OmarElgabry's Blog
11 min readApr 15, 2019

--

Threads have been a must, unavoidable pain in large scale application. We spend lots of time creating, managing and resolving conflicts. We take into consideration every single bit. Things like memory management, race conditions, deadlocks, scheduling, joining threads, and the list goes on.

And so, there has been an intriguing question:

Can we get rid of them? Are there any other alternatives?.

Yes, … Queues.

Even though it might not be possible in all cases, but moving away from threads can dramatically improve performance and the simplicity of the code.

Benefits?

Queues are cheaper. They reduce the memory cost for storing thread stacks in the application’s memory space. Tappings into the kernel are done when absolutely necessary.

They eliminate the code needed to create and configure the threads.

They eliminate the code needed to manage and schedule work on threads.

They simplify the code we have to write. We focus on the work to be performed without having to worry about thread creation and management including thread communication.

And so the underlying queueing software handles all of the thread creation and management for us and the benefit is that it manages threads much more efficiently than the corresponding threaded code.

The tasks inside the queue not only can be persisted, monitored, and visualized but can also be used as an audit log to trace back what happened to the application, basically, a history of all the changes. This opens the gate for another pattern called “Event Sourcing”.

Threads

So how threads are being used?

The most common scenarios:

  • Single task threads. Create a thread to perform a single task and release the thread when the task is done.
  • Worker threads. Create one or more worker threads with specific tasks in mind for each. Dispatch tasks to each thread periodically or as needed.
  • Thread pools. Create a pool of generic threads. When you have a task to perform, grab a thread from the pool and dispatch the task to it.

These techniques are really just variants on the same principle. In each case, a thread is being used to run some task that the application has to perform.

The only difference between them is the code used to manage the threads and the queueing of tasks.

Queues

And for queues …

Queues are first-in, first-out data structures, an easy way to perform tasks synchronously & asynchronously, while either serializing these tasks or run them concurrently.

A task is simply some work that the application needs to perform. For example, we could define a task to perform some calculations, create or modify a data structure, process some data read from a file, or any.

Queues allow associating custom context data with tasks. Moreover, one can optionally pass in callbacks that get fired when a task is errored or succeded.

We define tasks by placing the corresponding code inside either a function or an object and adding it to the queue.

/* await */ Queue.addTask(processOrder, { ...some data... })function processOrder(args) {
// ... task code
}

As mentioned, queues can execute tasks either serially or concurrently. And so, which one should be used; serial vs concurrent ?.

Source

Serial queues

Serial queues are useful when you want to execute one task at a time in the order in which they are added to the queue.

Serial queues are often used to protect, lock, access to a specific shared resource or mutable data structure so that no two threads can access it at the same time — We’ll dive into locks.

Each queue operates concurrently with respect to all other queues. If we created four serial queues, each executes only one task at a time but up to four tasks could still execute concurrently, one from each queue.

Although queues are cheap, avoid creating many serial queues and opt for concurrent queues whenever it is possible.

Concurrent queues

Concurrent queues are useful when you have multiple tasks that can run in parallel.

Even though a queue uses a first-in, first-out order, a concurrent queue may cause tasks to finish before or after the others around it. And so the execution order the tasks is not guaranteed.

The actual number of tasks executed by a concurrent queue at any given moment can scale dynamically based on the available resources and other conditions such as the number of available cores, the amount of work being done, and the number and priority of tasks in other serial dispatch queues.

Threads → Queues

With that being said, how could we convert the threaded code to using queues?

  • For a single task thread, encapsulate the task in a function or an object and dispatch it to a queue.
  • For worker threads, you need to decide whether to use a serial queue or a concurrent queue.
  • — If you use worker threads to synchronize the execution of tasks, use a serial queue.
  • — If you use worker threads to execute tasks with no interdependencies, use a concurrent queue.
  • For thread pools, encapsulate the task in a function or an object and dispatch them to a concurrent queue for execution.

Of course, simple replacements like this may not work in all cases. Tasks might be and will be, fighting on shared resources and therefore, a lock is required. Besides things like waiting on joining all child threads, listening to OS events, etc.

Locks

Replacing your lock-based code with queues eliminates many of the penalties associated with locks — in both the contested and uncontested cases — and also simplifies your remaining code.

Queues have an advantage in predictability.

Instead of using a lock to protect a shared resource, we can instead create a serial queue to serialize only the tasks that access that resource. And so, ensure that only one task has access to that resource at any given time.

Queues do not impose the same penalties as with locks.

Queueing a task does not require trapping into the kernel to acquire a mutex because a queue works primarily in the application’s process space and only calls down to the kernel when absolutely necessary.

No wonder, one or more threads might be blocked for some amount of time while waiting for the lock to be released, and a deadlock might occur.

It is true that deadlock can also happen in a queue if, for example, a synchronous task is being inserted into a serial queue from inside another synchronous task in the same queue. This is particularly important for serial queues, which are guaranteed to deadlock but should also be avoided for concurrent queues.

And as long as you submit your tasks to a serial queue asynchronously, the queue can never deadlock.

But, should we push tasks synchronously or asynchronously?.

  • Submitting a task asynchronously lets the current thread continue to run while the task is performed.
  • Submitting a task synchronously blocks the current thread until the task is completed.

Both options have appropriate uses, although it is certainly advantageous to submit tasks asynchronously whenever you can.

In both cases, if the queue itself executes tasks serially, changes to the resource are guaranteed to be made in the order in which they were received.

However, if the tasks were executed asynchronously, the calling thread does not block. On the other hand, the only reason to dispatch synchronously is to prevent the current code from continuing until the task execution finishes.

But, running tasks in a serial queue might hinder the performance as opposed to concurrent task execution?

Yes and no.

Which is better, having a queue with concurrency, say 10, which take a lock for the shared resource, make the necessary changes, release the lock, and continue. Or, doing the same thing but with a serial queue (one task at a time, no locks)?

It might require some profilers and test cases to speculate.

For now, do keep in mind that:

  • Although it is safe to use locks from inside the tasks, when we acquire the lock, we risk blocking other tasks from executing in a concurrent queue.
  • Even in the non-contested case, there is always a performance penalty associated with taking a lock.
  • If two threads take a lock at the same time, any concurrency offered by the threads is lost or significantly reduced.
  • Threads take up both kernel and user-space memory. Queues do not pay the same memory penalty for their threads, and the threads they do use are kept busy and not blocked.

Joining

Thread joins allow to spawn one or more child threads and then have the current thread wait until those threads are finished.

In threads world, one can have a parent thread waiting until the child finishes, at that point the parent can gather the results from the child and continue with its own work.

Queues, on the other hand, offer a similar semantics with some additional advantages. We can add some tasks to a group, and then wait until all these tasks finished executing.

Unlike thread joins, a queue group waits on all of its child tasks simultaneously. And because these groups are using queues to perform the work, they are very efficient.

// Add one or more tasks to the group (async)
queue.addToGroup(tasks, group);
// Do some other work while the tasks are executing
// ....
// When you cannot make any more forward progress,
// block the current thread until all the tasks in the group finish
queue.waitOnGroup(group);

Events

Threads can also be used when interacting with low-level system events. For example, reading and writing files, listening to a network socket, etc.

The underlying system generates notifications in response to specific types of system events. When an event occurs, a task is added asynchronously to the specified queue for processing.

Because the queue automatically executes any tasks added to it, there is no extra code required to manage the queue.

readFile.on('success', function(event){
// sendMessage is a task to be added on file read success
queue.add(sendMessage);
});

We can listen not only to system events but also custom events and trigger them from any part of the code asynchronously.

Loops

If the code has loops, and the work being done each iteration is independent and the order in which each successive loop finishes is unimportant — the execution order of the loop iterations is not guaranteed — we might consider using a concurrent queue. This way we can perform multiple iterations of the loop concurrently.

Using a serial queue is permissible and does the right thing for your code, but, using such a queue has no real performance advantages over leaving the loop in place.

for (i = 0; i < count; i++) {
printf("%d\n",i);
}

Although queues have very low overhead, there are still costs of scheduling each loop iteration on a thread. Therefore, you should make sure your loop code does enough work to warrant the costs.

In general, If a task code requires a noticeable amount of processing time or resources, consider using async queue (off to another thread), or off to a totally separate process with its own resources (CPU, memory, etc).

Suspending and Resuming

We can prevent a queue from executing the tasks temporarily by suspending it and resume back at some point.

Suspend and resume calls are asynchronous and take effect only between the execution of tasks. And so, suspending a queue does not cause an already executing task to stop.

Queues, like anything else, come with some challenges. These challenges and the workarounds are worth talking about.

Ant that brings us to an exciting story:

Scaling a single threaded application to server 6M requests/sec backed by an async, concurrent queue.

And that’s what we’ll be discussing during the rest of this article.

A modified version of Node.js Event Loop — Source

The asynchronous mindset

Most of the interactions on the web applications are built around requests. We send a request, perhaps press on a button, and wait until we get the result back. This is rather more obvious in audio and video applications where latency does matter.

On the other side, the asynchronous communications model is a bit different. You insert a task and get back a notification when work is done. There is no wait, no block, until work is done, unlike request-response.

We usually think of programming in synchronous terms and are not used to dealing with asynchrony.

Queues, when used asynchronously, might be inappropriate in situations where latency is an issue. The tasks are not guaranteed to finish at specific times. And threads might sound like the only survival plan.

So, it’s reasonable to consider how applicable the queues would be for something acting in an interactive mode.

And Yes, we might take a bit of time to adjust to the asynchronous style. But, nevertheless, we might end up with a more natural and often easier approach.

Performance

Even though async queues mightn’t guarantee interactivity, as in request-response approach, there are ways to significantly improve performance with queues, and thus expect to get a response as if we were issuing a normal request-response.

All in-memory

What if we loaded all the application data in memory? There might be a persistent database that setting somewhere, but, all work is done using the in-memory storage.

These days storing data is cheaper. More and more applications are quite capable of putting all their working set in main memory.

One benefit is speed. No database access, no slow IO desk access operation.

The second benefit is simplicity. Code becomes much simpler, no object mapping to a database.

What happens if everything crashes?

Should it crashes, rebuild the whole application and load it in memory again from the persistent storage.

But loading the whole data takes time?

Take snapshots. Take a snapshot every night during periods of low activity. On crash, restore from the latest snapshot and replay only all today’s events.

If not possible, or even better for almost no downtime, have another replica, sitting beside the main in-memory database that takes over when the current one crashes (M-M). The data has to be in sync with these two replicas.

Code

Burning the burden of dealing with lots of complexity allows other blind spots we have never considered to announce themselves.

Simplicity paves the way for rather concentrate on the elements of writing a well-factored, clean code, data structure choices, and performance testing.

Surprisingly, this could push the performance away more.

We are now able to do optimization and for CPUs to be more efficient in caching the code.

External Systems

When working with external systems, we have no control over them. We ‘re restricted by their performances and functionality.

An external service call is going to be slow, and with a single thread will halt the entire order processing machine.

As a result, you can’t make calls to external service and wait for a result back. Instead, we insert a task that does nothing but sends a request and quits. On response, a callback fires inserting another task to carry on the work.

No blocking and the application would then carry on receiving incoming requests.

Error Handling

The traditional model of database transactions provides a helpful error handling capability. Should anything go wrong, it is easy to throw away everything that happened so far in the interaction. If an error occurs on the database side you can rollback the transaction.

However, in-memory structures don’t have transactions, and so no rollbacks. If there is an error it is important to not leave that memory in an inconsistent state.

And thus, it is important to ensure the input data are fully valid before doing any mutation of the in-memory state. Testing is a key tool in flushing out these kinds of problems before going into production.

Thank you for reading!

Feel free to reach out on LinkedIn or Medium.

--

--

Omar Elgabry
OmarElgabry's Blog

Software Engineer. Going to the moon 🌑. When I die, turn my blog into a story. @https://www.linkedin.com/in/omarelgabry