Work-Stealing Algorithm Distilled

Published in

The Startup

4 min readSep 7, 2020

Before talking about Work Stealing, we need to talk about task management. We could use Tomcat as one example. Tomcat will use multiple threads to handle concurrent connections. The number of threads can go very high when there are many concurrent connections at the same time.

However, a computer has a fixed number of cores. The number of threads that could be run at the same time is equal to the number of cores. All other threads have to be in waiting/ready states. The ready state threads will be scheduled to take over the CPU when the current running thread starts to do blocking IO. Scheduling in and out of threads is done by OS and it takes time. Too many threads will also consume more system resources, like the memory used for the thread data structure. Too many threads will also hurt cache locality.

The ideal number of threads is the number of cores, and each thread is running on its own core without being scheduled out. However, this is the ideal case. Because when one thread is invoking a blocking IO API, the thread will be put into waiting state by OS. Another ready thread will be scheduled in to use the CPU. Currently, many DB drivers are being rewritten to support nonblocking API. These DB drivers include mongo, redis, cassandra, etc. But MySQL DB driver does not support nonblocking IO.

If we could not control third party blocking API, we could control our own code. When we are writing our own tasks, if there are no blocking IOs involved, we don’t have to block one thread on the output of another thread.

JDK and third party projects are also doing a lot of work on how to smartly manage task dependencies, trying to introduce as less as threads as possible to squeeze the CPU usage for one thread. This includes ForkJoinPool, Reactor Project, Netty, Project Loom etc.

Since JDK 7, ForkJoinPool is introduced. The important thing about ForkJoinPool is that it’s created with the following ruling points.

Each thread has its own task queue. Each task queue is a cicular array.
Each thread uses push and pop to add or remove tasks to its own queue.
ForkJoinPool uses the work-stealing algorithm to balance the workload on different threads. The pool maintains a global work queue that stores externally submitted tasks. Each worker thread will pop tasks from its own task queue. If there are no tasks in its own queue, it will try to randomly steal tasks from the shared work queues or other workers. If it fails to find tasks from both shared queues or other threads, it will go to sleep.

push: used by worker thread to push task to the top of its own work queuepop: used by worker thread to pop task from the top of its own workpoll: used by other thread to steal task from the bottom of the work queue of a different thread

Since this work-stealing algorithm is so general, it could be used in all other languages to manage task scheduling. For example, in Kotlin and Python, there are coroutines. A coroutine can be considered as user-space light thread. It does not involve scheduling of OS. It’s completely user-space task scheduling. The benefits of it are that it could use very few threads to manage the tasks. Project Loom is also introducing Fiber in Java. Fiber is similar to Coroutine.

The general trend for user application development is tending to use lighter threads to manage user tasks. It achieves better performance with a lower number of threads, non-blocking IO, and user-space task management.

CompletableFuture

CompletableFuture internally uses ForkJoinPool to manage the scheduling of the tasks. The following is one simple example.

CompletableFuture.supplyAsync(() -> {
    try {
        Thread.sleep(30000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    System.out.println("current thread is "+Thread.currentThread().getId());
    return Thread.currentThread().getId();
}).thenCombineAsync(CompletableFuture.supplyAsync(() -> {
    try {
        Thread.sleep(70000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    System.out.println("current thread is "+Thread.currentThread().getId());
    return Thread.currentThread().getId();
}), (l, r) -> l + r)
        .thenAcceptAsync(x ->
        {
            System.out.println("current thread is "+Thread.currentThread().getId());
        }).thenRunAsync(() -> {
            System.out.println("current thread is "+Thread.currentThread().getId());
        });

supplyAsync by its name is saying its forking one Async Task. This task will be submitted to ForkJoinPool. ForkJoinPool will use one worker thread to run it.

combineAsync says that it should combine the result of the previous two submitted tasks. Since comebineAsync depends on the previous task results, so CompletableFuture will not start a new thread for this task. It uses the existing thread to run the task when the dependent tasks are finished.

runAsync is just submitting another independent task to ForkJoinPool.

There are many APIs in CompletableFuture. The way to construct CompletableFuture tasks is very similar to Promise in frontend. When many callbacks are chained together, it could make the code not so readable. Some third-party libraries introduced language constructs async and await in Java similar to Javascript promise.

Work-Stealing algorithm is also used in Reactor Project. And it might also be in many other lower-level task scheduling frameworks, libraries, kernel task management, etc.

If you want to know the details of Work-Stealing algorithm, you could read the paper and the source code of the ForkJoinPool.

https://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java#l1348

https://www.dre.vanderbilt.edu/~schmidt/PDF/work-stealing-dequeue.pdf

Please leave a comment if you find anything wrong.

Work-Stealing Algorithm Distilled

Written by Ryan Zheng