Concurrency approaches for faster applications

George Leivadas
Travelex Tech Blog
Published in
8 min readFeb 22, 2018

What is concurrency

Concurrency refers to the ability of handling multiple tasks at a time. Note that handling does not mean necessarily executing them simultaneously. That is a major distinction with parallelism. You can have concurrency without doing things in parallel. A good example is a single core CPU executing multiple processes one at a time but not in a sequential order.

Comparison between sequential, concurrent and parallel task completion.

The picture above illustrates the difference between sequential, concurrent and parallel execution. There are 4 different tasks (each one is represented by different color). In the sequential execution each task is executed in order and only one at a time. In concurrent execution a task might be started before a previously started task has finished. However, progress is made in only one task at any given time. Finally, in parallel execution more than one tasks make progress simultaneously.

Why concurrency matters

Concurrency improves:

applications by reducing completion latency (speed) and increasing the responsiveness which sometimes (especially in GUIs) is highly appreciated! (e.g. the common frustration that users get when hitting the cancel button and the application does not respond)

resource utilisation by allowing different resources operate in different speeds without one blocking the other. Traditionally, CPU's have always been much faster than any other part of a computer system meaning that any task that requires data to be fetched outside the ‘CPU ecosystem’ could potentially waste CPU cycles if not handled properly.

The table above describes how much time is actually needed (actual latency) for a typical action to be performed. In order to realize better the huge difference between a CPU cycle and any other action, the second column is scaled so that it shows the amount of time that would have been needed if 1 CPU cycle needed 1 second to be completed.

Limits of concurrency improvements

Although concurrency usually increases the speed of a process, it is still bounded though by the portion of it that cannot be isolated and executed independently of the rest of the process.

The total maximum improvement, for a given fixed workload, is provided by Amdahl’s Law.

Total Maximum Improvement (TMI) = 1/ [(1-P) + P/S]

where P is the percentage of the portion of the task that can be parallelized and S is the speedup that you gain for that particular portion.

Assuming that you have one task that consists of three different sub tasks:
T = t1 + t2 + t3 where t1 takes 3 seconds, t2 takes 7 seconds and t3 takes 20 seconds to complete respectively. If you manage to make t1 to run in 0.3 seconds the overall gain that you will have is:

P = Time of the task that can be parallelized / Total time = t1 / (t1 + t2 + t3)

P = 3 / (3 + 7 + 20) = 0.1

As you plan to reduce the t1 to 0.3 seconds → S= 3/0.3 = 10 times faster

By Amdahl’s Law the total maximum improvement you gain for the overall process is:

TMI = 1/((1–0.1) + 0.1 / 10 = 1.098 times faster.

If you were to optimize t3 by making it twice as fast then the speedup you get is 1.51. So it’s better to make t3 two times faster rather than t1 10 times faster. This is something that you should keep in mind when you do optimizations.

Always start with those that will have the biggest impact!

Concurrency techniques

Dispatcher worker approach

One of the simplest approaches is the worker approach where the initial task is divided into smaller sub tasks and each one of them can be completed independently and in parallel by a different thread. Below is an example of a web server implementation that uses this approach.

Incoming requests are placed into a queue from which a predefined pool of workers (each worker is a different thread) are processing the requests. Each worker grabs a request object from the queue, process it and places the result object to another queue (response queue). The proper number of workers is really dependent on the nature of the tasks that process. If the tasks are CPU intensive then a value of 2 threads / per core is usually enough. However, if workers perform a lot of I/O then higher values can be assigned since the CPU will be idle most of the time waiting for I/O to finish. All in all, the exact number of threads is something that should be measured and it’s inextricably connected to the specific user case.

When using this approach special emphasis should be given to avoid contention among workers and yet preserve thread safety. Since every worker is executing much of the business logic it is highly probable that workers will require access to shared state. This in fact poses a challenge. By using too much of synchronized blocks, locks or any other mechanisms that restrict access to shared state will degrade the performance of the system but failing to do so will destroy the correctness.

Reactive approach

The reactive approach is inspired by the notable Actor model invented by Carl Hewitt in 1973. The Actor model has been used as a framework to understand concurrency in general and also as a theoretical point of reference for implementing highly concurrent systems. The Actor model describes a universe that consists of actors. An Actor is the smallest computational entity that can send and receive messages.

In addition an Actor must satisfy the following three axioms:

  • Send a finite number of messages to other Actors.
  • Create new Actors.
  • Receive a message and designate the behavior to be used for the next incoming message.

The last property implies that an Actor can potentially have a local internal mutable state. However this state can only be changed as a result of a message that was received. It is not allowed for an Actor to expose its state to other Actors directly.

Communication between workers is done directly using immutable messages using the Actor’s mailbox. A mailbox is like an address where messages are sent to in order to be forwarded to an Actor. There is a many-to-many relationship between Actors and mailboxes. A mailbox can be shared across many Actors and an Actor can have many mailboxes. All messages sent to a mailbox are not guaranteed to be delivered in order or to be delivered at all, as the model specifies that its only a ‘best effort’ approach. However, should a message be delivered that has to happen at most once. In addition there are no time restrictions in message delivery and it can be assumed that any message can take arbitrary long time to be received.

A very good, yet simple explanation about the actor model can be found here.

The equivalent ‘reactive solution’ to the previous dispatcher example.

The reactive approach mimics the actor model in the sense that isolates state inside each module and favors message passing (instead of memory sharing) for communication between modules. Although a thread can be assigned to each one of the modules, this is not necessary as you can have a N threads for M workers where N<M.

Single threaded event loop approach

In the arena of multi threading environments a lot of potential challenges and issues (e.g. deadlocks, priority inversions, race conditions, etc.) have to be addressed carefully in order to ensure an efficient system and still maintain correctness.

Another simple but very powerful approach is to only have a single thread for executing all the non-blocking code and make use of asynchronous APIs for tasks that may block. One example that uses this approach is the Node JS platform.

In spite of, what the name of this approach suggests (partially true), this is one more approach that utilizes multiple threads to achieve concurrency. The ‘single threaded event loop’ architecture consists of three major components.

  • Event loop
  • Event queue
  • Internal thread pool

The event loop is the most crucial component of this architecture where all non blocking operations are being executed. This is done by a single thread (hence the name) only.

The event queue is simply FIFO data structure where tasks are added only to be later served by the event loop component.

Finally, the internal thread pool is used for executing I/O operations.

The queue is used to submit incoming tasks. The event loop checks constantly for any new tasks and if any found these are being completed on a first come first served basis. In case of a blocking operation is needed to complete the task, the control is passed to the internal thread pool along with a callback. Once the result is ready, the callback is then used to return the control back to the event loop and execution continues until either the next blocking operation or the request completion.

Having only one thread dealing with the ‘core business logic’ provides inherent thread safe processing without having the need for locks or any synchronized blocks. In addition, less memory is consumed and better CPU utilization is achieved due to less context switching. Usually applications that are not very CPU intensive tend to perform better using this approach. However, one must be very careful when using this model as even a single blocking operation inside the event loop and an imminent disaster is almost certain.

Choosing the right approach

As often there is no silver bullet when trying to find an optimal solution to a problem, this one is no exception. Any of the aforementioned concurrency approaches can be better suited to a particular problem. System environment, problem nature, solution requirements (e.g. low latency, high throughput, etc.) are all factors that should be taken into consideration when choosing the right approach. For example if tasks are independent, then the simplicity of the dispatcher approach offers a good solution where control flow clarity is honored without compromising performance. On the other hand, if code flow is convoluted then the actor approach tackles the problem more effectively since each actor is focused in one thing only and mutable state is isolated. Last but not least, the ‘single threaded event loop’ approach is best when tasks are not CPU intensive and a lot of I/O operations are being done.

--

--