A simple benchmark for JDK Project Loom’s virtual threads.

Alexander Zakusylo
4 min readJul 4, 2020

--

Performance always matters. Photo: Sonja Langford

Project Loom from Oracle’s JDK team is a highly expected feature aiming to make concurrency simple and even faster. Project Loom will introduce virtual threads — lightweight threads that are cheap to create so it’s fine to create thousands and even millions of them in a single JVM process.

When you plan to run some tasks concurrently, the optimal thread configuration depends on how CPU-intence your tasks are. The classic rules are:

  • A. For math calculations and other pure CPU tasks, use one thread per physical CPU core: say, a pool of 4 threads would be optimal for a quad-core machine.
  • B. For tasks that do not involve CPU at all (network and other slow IOs, waiting for events, etc), create a thread per task.
  • AB. For tasks that involve some CPU load and some waiting, balance the thread number between the numbers from the cases A and B according to the expected CPU utilization percentage. A popular formula from a Brian Goetz’s book recommendsthreads = cores * (1 + Waiting time / CPU time)

For case B, the idea of dedicated threads is good as long as the OS is still able to provide the threads as required. When the number of tasks grows over thousands, the overhead of creating native OS threads becomes too heavy. Here’s where the virtual threads should come in help.

Imagine that we concurrently start 100k tasks that just sleep for 100ms and then reply to the main calling thread. Once the main thread collected all the responses, it assumes the job done. How long would the whole thing take? In a perfect world, we’d expect it to take a little longer than 100ms. The tasks just sleep, so there’s no CPU usage, so it should be possible to parallelize them perfectly.

I implemented this with actr — minimalistic typesafe Java actor model implementation, that allows us to configure the thread pool to execute tasks; actr also supports Loom’s virtual threads (of course, using an appropriate unreleased JDK build).

We’ll run our benchmark on the following actr’s task scheduling configurations:

  • Fixed thread pool with the number of threads equal to the number of CPU cores pool (normally optimal for case A);
  • Work stealing ForkJoinPool based scheduler — it’s normally useful for varying load to avoid overloading of some threads when others are idle;
  • Thread per task scheduler (normally optimal for case B);
  • Virtual thread per task scheduler (expecting this to perform better);
  • Single thread for all scheduler (no concurrency in fact) — just for fun.

The code for the benchmark can be found here https://github.com/zakgof/akka-actr-benchmark/blob/master/src/main/java/com/zakgof/aab/ActrParallelSleepingTell.java (see https://github.com/zakgof/akka-actr-benchmark/ for more actr and akka related benchmarks):

public static void run(int actorcount, IActorScheduler scheduler) throws InterruptedException {  final IActorSystem system = Actr.newSystem("actr-massive",    
scheduler);
IActorRef<Master> master = system.actorOf(
() -> new Master(actorcount));
master.tell(m -> m.start());
system.shutdownCompletable().join();
}
private static class Master { private final BitSet bitset;
private final int actorcount;
public Master(int actorcount) {
this.actorcount = actorcount;
this.bitset = new BitSet(actorcount);
bitset.set(0, actorcount);
}
public void start() {
for (int a = 0; a < actorcount; a++) {
IActorRef<Runner> runner = Actr.system().actorOf(Runner::new);
int actoridx = a;
runner.tell(r -> r.run(actoridx));
}
}
public void runnerReplied(int actorIdx) {
bitset.clear(actorIdx);
if (bitset.isEmpty()) {
Actr.system().shutdown();
}
}
}
private static class Runner {
private static final int SLEEP_AMOUNT = 100;
private void run(int actorIdx) {
try {
Thread.sleep(SLEEP_AMOUNT);
} catch (InterruptedException e) {
}
Actr.<Master> caller().tell(m -> m.runnerReplied(actorIdx));
}
}

The results were obtained on the following HW/SW config:

Intel Core i5-6500
OpenJDK 16-loom+2-14 (Early access build from 2020/6/27)
Windows 10

What we’d expect in a perfect world?

one-thread-for-all     100.000 s/op             (1000 tasks x 100ms)
fixed-thread-pool 25.000 s/op (1000 tasks/4 threads x 100ms)
fork-join-pool 33.333 s/op (1000 tasks/3 threads x 100ms)
thread-per-task 0.100 s/op (simultaneous wait of 100ms)
virt-thread-per-task 0.100 s/op (simultaneous wait of 100ms)

Finally, the actual results:

one-thread-for-all     109.435 ± 0.023   s/op
fixed-pool 25.363 ± 1.721 s/op
fork-join-pool 34.711 ± 0.755 s/op
thread-per-task 0.216 ± 0.092 s/op
virt-thread-per-task 0.117 ± 0.010 s/op

The first three look more-or-less as expected — not so much interested here.

Running NOOP tasks in 1000 native Windows threads seems to be a bit heavy for the system: overhead constitutes more than 100% of the consumed time (216ms against the ideal-case expectation of 100ms running time).

And here, ladies and gentlemen, virtual threads come in action — almost x2 faster than the native threads. Even on an Early Access build (with Loom not even assigned to any JDK release yet), we can see much better performance of virtual threads as compared to the native threads. Overhead over the theoretical expectation is 17% (well, better than the 116% overhead for natives).

Well done, Oracle, now we can’t wait for getting Loom released!

--

--