Java Virtual Threads

Revisited 2024 April

Borislav Stoilov
CodeX
16 min readJun 10, 2022

--

Virtual threads are now an official part of the Java language, released with Java 21, which is a Long-Term Support (LTS) version. This means that we can use them without enabling experimental features or persuading management to switch to a non-LTS Java version.

In the early versions of Java, when the multithreading API was designed, Sun Microsystems faced a dilemma: should they use user-mode threads or map Java threads one-to-one with OS threads? At the time, all benchmarks indicated that user-mode threads were significantly less efficient, leading to increased memory consumption without offering substantial benefits. However, these benchmarks were conducted over 20 years ago, when circumstances were quite different. The demand for high-load processing was lower, and the Java language was still relatively immature. Now the landscape has changed, and there have been a few attempts to introduce user-mode threads into Java, such as the Fiber project. Unfortunately, because Fiber were implemented as a separate class, it was quite challenging to migrate an entire codebase to use them. As a result, they eventually fell out of favor and were never integrated into the core Java language.

This all changes with Project Loom

Project Loom

There are several Java projects with very specific tasks to achieve. These include Valhalla, Panama, Amber, and of course Loom. Loom’s goal is to overhaul the concurrency model of the language. It aims to bring virtual threads, structured concurrency, and a few other improvements to Java concurrency, like ScopedValues.

Few words on the Java Concurrency model

The way threads are implemented in the JVM is considered one of the best, even by non-Java developers. We have excellent thread debugging; you get thread dumps, breakpoints, memory inspection, and much, much more. You can even use the JFR API to define custom metrics for your threads.

The Thread class is how Java provides access to the OS Thread API. Most of the operations performed in this class make system calls. In production, we rarely use Thread directly; we use the Java concurrency package with Thread Pools, Locks, and other useful features. Java has excellent built-in tooling for multithreading.

Concurrency and parallelism

Before we dive into the interesting topics, we need to clarify an important point. Concurrency and parallelism are two concepts that are often confused, leading to misunderstandings.

Parallel means that two or more tasks are executed at the same time. This is possible only if the CPU supports it, requiring multiple cores to achieve parallelism. However, modern CPUs are typically multi-core, and single-core CPUs are largely outdated and no longer widely used, as they are significantly outperformed by multi-core ones. This shift is because modern applications are designed to utilize multiple cores, often needing to perform several tasks simultaneously.

Concurrency means that tasks are managed at the same time. For example, JavaScript is a single-threaded language where all concurrent tasks are managed by a single thread. JavaScript uses async/await to facilitate concurrency.

From the OS perspective, the CPU has to handle the threads of multiple processes. The number of threads is always higher than the number of cores, which means the CPU must perform context switches. Briefly explained, every thread has a priority and can either be idle, working, or waiting for CPU cycles. The CPU has to go through all threads that are not idle and distribute its limited resources based on priority. It also must ensure that all threads with the same priority get a fair amount of CPU time; otherwise, some applications might freeze. Every time a core is assigned to a different thread, the currently running thread has to be paused, and its register state preserved. Additionally, the CPU has to track whether any idle threads have become active. As you can see, this is a complex and resource-intensive operation, and as developers, we should aim to minimize the number of threads we use. Ideally, the thread count should be close to the number of CPU cores, which helps reduce context switching overhead.

Modern Java server concurrency problems

The cloud space is growing rapidly, leading to increased load and resource requirements. Most enterprise servers — those that handle the heaviest workloads — are built with Java, placing the burden on Java to address the load problem. So far, Java has performed well, given that it’s still the most popular language for servers, but that doesn’t mean it’s flawless.

The common way we handle requests is by dedicating a platform thread to each one. This is known as the “thread-per-request model.” When a client makes a request, a thread is assigned to handle it, and it remains occupied until the task is complete. Servers usually start with a predefined number of threads (like 200 in Tomcat), which are placed in a thread pool to await requests. Their initial state is “Parked,” which means they don’t consume CPU resources until they’re needed.

This approach is simple to write, understand, and debug, but it has limitations. For example, what if the client makes a request that involves a blocking call? Blocking calls are operations that wait for third-party actions to complete, such as a SQL query, a request to another service, or an I/O operation with the OS. When a blocking call occurs, the thread is unusable while waiting, but it still requires CPU management because it’s not idle. This increases context switching, impacting performance.

Servers impose limits on the number of threads; while higher thread counts might boost throughput, they can significantly slow down request processing due to excessive context switching. It’s a fine balance that must be maintained. People sometimes ask, “Why not just spawn 10,000 threads to handle 10,000 requests at once?” Although technically possible — even spawning 1 million threads with proper configuration — benchmarks show that popular CPUs experience about 80% CPU utilization due to context switching with just 3,000–4,000 threads. Moreover, the OS still needs CPU resources to manage other processes.

To address scalability issues, a common solution is to scale horizontally by adding multiple server nodes. This works, allowing you to handle as many requests as needed if you can afford the cost. However, in cloud technologies, one of the main goals is to reduce operational costs. Scaling horizontally might not be feasible for all budgets, potentially leading to slow or barely usable systems.

Concurrency Models

Concurrency to the rescue! Let’s talk about a few concurrency models adopted in other languages

Callbacks

Callbacks are a simple yet powerful concept. They are objects passed as parameters to other functions or procedures. The parent function provides the callback to the child function, which can use it to notify the parent function of certain events — for example, “I have completed my task.” This approach allows concurrency on a single thread. Callbacks create a stack trace that can simplify debugging. They work well when the nesting is one or two levels deep but can quickly become unmanageable if the callback chain becomes too complex. Today, callbacks are mainly used as building blocks for other concurrency models and are generally considered a legacy practice.

Async/Await and Promises

Promises represent a future computation or potential failure. A function can return a promise, like the result of an HTTP request, and caller functions can chain their logic to it. This is the method of achieving concurrency in many popular languages. Java also has a promise-like structure called Futures, but only the CompletableFuture has the complete feature set of a typical promise. However, many operations in Java are blocking, which can negate some benefits of using Futures.

Async/await is syntactic sugar over promises, reducing the boilerplate code needed for chaining, subscribing, and managing promises. Generally, you can mark a function as async, and its result is internally wrapped in a promise.

One significant issue with async/await is the so-called “colored function” problem. Using async on a function makes it non-blocking, but blocking functions (those without the async keyword) cannot call them without using await. You might think, “I’ll just make everything async and avoid blocking functions,” but if you use even one third-party library that has blocking functions, this approach can quickly fall apart. Additionally, certain language features or operations might be inherently blocking, forcing you to deal with function coloring at some point.

Coroutines (Continuation + routine)

When we talk about coroutines, we don’t mean Kotlin’s coroutines, even though they use the same term.

Continuations are a specific kind of function call. If function A calls function B, and that’s the last thing A does, then B is considered a continuation of A.

Routines (also known as subroutines) are reusable blocks of code typically called multiple times during execution. You can think of them as a fixed set of instructions with defined input and output that can be invoked as needed.

By combining these concepts, we get coroutines. Coroutines are essentially suspendable tasks managed by the runtime, forming a tree-like structure of chain calls.

Coroutines have a few key properties:

  • They can be suspended and resumed at any time.
  • They are a data structure that can retain state and stack trace.
  • They can yield control to other coroutines (or subroutines).
  • They typically have methods like isDone(), yield(), and run().

Virtual Threads

The developers of Project Loom had many considerations and multiple options for implementing virtual threads. I’m glad they chose the coroutine approach. The Thread class remains unchanged and uses the same API, ensuring a seamless migration; switching to green threads is just a flag. However, this comes at a significant cost. They had to modify every API in the JDK, such as Sockets and I/O, to ensure they don't block when running on a virtual thread. This is a massive change affecting the core APIs of the JDK, and it has to be backward-compatible to avoid breaking existing code. It's no wonder this process took more than five years to complete.

To switch to virtual threads, you don’t need to learn anythinh new; you just need to unlearn a few things:

  • Never pool virtual threads; they are cheap, and pooling doesn’t make sense.
  • Avoid using thread-local. It works, but if you spawn millions of threads, it can lead to memory problems. As Ron Pressler noted, “Thread locals should never have been exposed to end-users and should have stayed as an internal implementation detail.”

Here’s an almost exhaustive list of advantages virtual threads have over platform threads:

  • Context switching is effectively free. Virtual threads are managed by the JVM, which handles context switching.
  • Tail-call optimization. The JEP mentions that tail-call optimization is applied to virtual threads, which can save stack memory, although this is still a work in progress.
  • Cheap start/stop. Stopping an OS thread involves a system call to terminate the thread and release its memory. Starting and stopping virtual threads is just a matter of deleting the object and letting garbage collection handle it.
  • High upper limits. The OS can only manage a finite number of threads, even as hardware improves. Currently, you can spawn tens of millions of virtual threads, which should be more than enough for most cases.
  • A thread performing a transaction behaves differently from one processing video. This distinction can be easily overlooked. The OS and CPU must be optimized for general purposes, handling a variety of tasks requested by applications, but the JVM can optimize its threads for specific Java tasks.
  • Resizable stack. Virtual threads live in RAM, including their stack and metadata. Platform threads require a fixed stack size (with Java, it’s 1 MB), which can’t be resized. This can lead to stack overflows if you exceed the limit and wasted memory if you don’t use it all. The minimum required memory to bootstrap a virtual thread is about 200–300 bytes, offering significant memory efficiency.

Working with Virtual Threads

Consider the following

for (int i = 0; i< 1_000_000; i++) {
new Thread(() -> {
try {
Thread.sleep(1000);
} catch(Exception e) {
// deal with e
}}).start();
}

Here, we attempt to create 1 million regular threads. Each thread simply sleeps for 1 second and then terminates. As expected, this code leads to an OutOfMemoryError. On my machine, I was able to spawn 40,000 threads before running out of memory.

Now let’s try spawning virtual ones. To create a new virtual thread, we have to use Thread.ofVirtual().start(runnable)

    Thread.ofVirtual().start(() -> {
try {
Thread.sleep(1000);
} catch (Exception e) {
// deal with e
}
});

This code works fine; I was able to spawn more than 20 million of these threads on my machine. This is expected because user-mode threads are essentially objects in memory managed by the JVM.

Let’s delve deeper. Typically, when we use threads, we manage them with thread pools. Now let’s define a piece of blocking code that needs to be executed.

static void callService(String taskName) {
try {
System.out.println(Thread.currentThread() + " executing " + taskName);

new URL("https://httpstat.us/200?sleep=2000").getContent();

System.out.println(Thread.currentThread() + " completed " + taskName);

} catch (Exception e) {
// deal with e
}

}

We print the platform (carrier) thread that is running the code, then we make a 2-second HTTP call, and then we print the carrier thread again.

Now let's run this in parallel

try (ExecutorService executor = Executors.newFixedThreadPool(5)) {
for (int i = 0; i <= 10; i++) {
String taskName = "Task" + i;
executor.execute(() -> callService(taskName));
}
}

We create a fixed-size thread pool with 5 threads and then submit 10 tasks that call the someWork method described earlier. Did you notice something different? The thread pool is inside a try-with-resources block! This is a new addition in Java 19: try-with-resources now waits for all threads to finish, so we no longer have to explicitly call shutdown and awaitTermination when working with thread pools. Anyway, the code above produces the following output:

Thread[#20,pool-1-thread-1,5,main] executing Task0
Thread[#24,pool-1-thread-5,5,main] executing Task4
Thread[#21,pool-1-thread-2,5,main] executing Task1
Thread[#22,pool-1-thread-3,5,main] executing Task2
Thread[#23,pool-1-thread-4,5,main] executing Task3
Thread[#23,pool-1-thread-4,5,main] completed Task3
Thread[#21,pool-1-thread-2,5,main] completed Task1
Thread[#24,pool-1-thread-5,5,main] completed Task4
Thread[#20,pool-1-thread-1,5,main] completed Task0
Thread[#22,pool-1-thread-3,5,main] completed Task2
Thread[#22,pool-1-thread-3,5,main] executing Task9
Thread[#23,pool-1-thread-4,5,main] executing Task5
Thread[#21,pool-1-thread-2,5,main] executing Task6
Thread[#24,pool-1-thread-5,5,main] executing Task7
Thread[#20,pool-1-thread-1,5,main] executing Task8
Thread[#24,pool-1-thread-5,5,main] completed Task7
Thread[#21,pool-1-thread-2,5,main] completed Task6
Thread[#22,pool-1-thread-3,5,main] completed Task9
Thread[#20,pool-1-thread-1,5,main] completed Task8
Thread[#24,pool-1-thread-5,5,main] executing Task10
Thread[#23,pool-1-thread-4,5,main] completed Task5
Thread[#24,pool-1-thread-5,5,main] completed Task10

Note how every task is executed by the same thread. (for example Task4 is executed and completed by Thread[#24,pool-1-thread-5,5,main])

This indicates that while the blocking http call was running the thread was waiting and after 2 seconds resumed.

Now let's convert this to User Mode threads. The code is the same, we just had to use Executors.newVirtualThreadPerTaskExecutor() which creates a new virtual thread every time a task is submitted to it.

try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i <= 10; i++) {
String taskName = "Task" + i;
executor.execute(() -> callService(taskName));
}
}

This time we get the following output

VirtualThread[#29]/runnable@ForkJoinPool-1-worker-8 executing Task7
VirtualThread[#24]/runnable@ForkJoinPool-1-worker-4 executing Task3
VirtualThread[#23]/runnable@ForkJoinPool-1-worker-3 executing Task2
VirtualThread[#22]/runnable@ForkJoinPool-1-worker-2 executing Task1
VirtualThread[#25]/runnable@ForkJoinPool-1-worker-5 executing Task4
VirtualThread[#26]/runnable@ForkJoinPool-1-worker-6 executing Task5
VirtualThread[#30]/runnable@ForkJoinPool-1-worker-9 executing Task8
VirtualThread[#20]/runnable@ForkJoinPool-1-worker-1 executing Task0
VirtualThread[#28]/runnable@ForkJoinPool-1-worker-7 executing Task6
VirtualThread[#32]/runnable@ForkJoinPool-1-worker-7 executing Task10
VirtualThread[#31]/runnable@ForkJoinPool-1-worker-10 executing Task9
VirtualThread[#31]/runnable@ForkJoinPool-1-worker-3 completed Task9
VirtualThread[#24]/runnable@ForkJoinPool-1-worker-4 completed Task3
VirtualThread[#20]/runnable@ForkJoinPool-1-worker-9 completed Task0
VirtualThread[#32]/runnable@ForkJoinPool-1-worker-6 completed Task10
VirtualThread[#25]/runnable@ForkJoinPool-1-worker-8 completed Task4
VirtualThread[#28]/runnable@ForkJoinPool-1-worker-5 completed Task6
VirtualThread[#23]/runnable@ForkJoinPool-1-worker-7 completed Task2
VirtualThread[#29]/runnable@ForkJoinPool-1-worker-1 completed Task7
VirtualThread[#26]/runnable@ForkJoinPool-1-worker-10 completed Task5
VirtualThread[#30]/runnable@ForkJoinPool-1-worker-2 completed Task8
VirtualThread[#22]/runnable@ForkJoinPool-1-worker-3 completed Task1

Notice how the task is now executed by two threads: the first one handles the code before the blocking call, and the second one takes over after the blocking call. For example, Task5 is initially executed by ForkJoinPool-1-worker-6 and 2 seconds later is completed by ForkJoinPool-1-worker-10. This demonstrates that the carrier thread isn't blocked. Additionally, we are now using the fork-join pool. This pool has a size equal to the number of cores and is managed by the JVM. It's also used for operations like parallel streams.

When not to use Virtual Threads

Managing virtual threads comes with a performance cost. For applications with lighter loads, platform threads might outperform virtual threads simply because context switching is minimal when there are few active clients. Moreover, if your application is CPU-intensive, involving a lot of mathematical computations, using green threads doesn’t make much sense since they would occupy the OS thread during calculations anyway. Loom won’t make your applications faster; it will just increase their throughput, so if throughput isn’t an issue, sticking to platform threads might be a better choice. This blog post does a good analysis of how Loom utilizes threads, but be aware that the author criticizes Loom for issues it’s not intended to address (like fair thread scheduling).

Identify your bottlenecks. If you’re using Postgres with a connection pool of 50, spawning more than 50 threads (platform or virtual) won’t make a difference.

For reference, you can check Little’s Law and this excellent article on choosing the optimal number of threads.

Pitfalls

syncronized keyworkd pins the carrier thread, your code should start relying on Locks (like ReentrantLock)

This is a limitation that is actively being worked on. Oracle is not announcing any timelines, but for the time being, we need to avoid using synchronized if we want to get optimal results from virtual threads.

Existing Thread Pools that have dual purpose of reusing threads and limiting the concurrent access to a given resource

Very commonly, we use fixed thread pools to limit the maximum number of threads that can access a given resource. Since virtual threads should not be pooled, this must be replaced with a semaphore. Yet another example of why we should use things according to their interfaces.

Spring Boot

Spring Boot 3.2+ supports virtual threads, and enabling them is as simple as setting a flag. I tried this with a few projects — the smaller ones transitioned smoothly. However, when dealing with older codebases with numerous dependencies, it can be quite challenging to make them virtual-thread-friendly. For example, when I switched to virtual threads on one of the higher-load servers I was working on, it led to race conditions that eventually caused a deadlock in the fork-join pool. Debugging was nearly impossible due to the size of the codebase. Sections of the code previously accessed by only 2–3 threads were suddenly accessed by hundreds, requiring much more robust synchronization.

The flag in question is

spring.threads.virtual.enabled=true

And this is the official announcement from Spring regarding virtual thread support.

Structured concurrency

Structured concurrency involves managing the lifecycle of threads. In the current model, there’s no direct way to stop a thread when its result is no longer needed or has become obsolete. The only available method is to send an interrupt signal, which will eventually be consumed by the thread, causing it to stop. This inefficiency can waste both RAM and CPU cycles.

Let’s consider the following situations

All Tasks have to succeed if one fails there is no point in continuing

At least one task has to succeed if one succeeds no point in waiting for the rest

Deadlines. If execution is longer than a certain time we want to terminate everything

I have marked in red the threads that should be stopped immediately after a certain state is reached. With Loom, we can do this. Currently, aside from using a special thread pool, there is no other way to stop these threads except by manually stopping them. However, this JEP promises to introduce more tools to manage these scenarios.

These optimizations might seem minor, and indeed they may not be significant for small applications or servers with low load. But when you’re processing millions of requests per day, these optimizations can be game-changers, significantly boosting throughput in certain cases.

Should you consider switching to Loom if you are using reactor frameworks

Reactor frameworks are great at dealing with the throughput problems of the application. What they do is essentially create abstract tasks (similar to coroutine) and wrap everything inside them. Then the reactor runtime manages these tasks. Sounds very similar to virtual threads but there are a few major problems

  • The language doesn’t support it natively, which leads to very complicated code (Flux/Mono)
  • The excellent Java Thread debugging that we talked about earlier is completely disregarded in favor of a centralized error handler that gives almost zero information about what happened. We have to rely mostly on logs
  • Once you go reactor style it's very hard to go back, you will probably have to rewrite everything from scratch
  • Brian Goetz states that Loom will kill webflux. I am not saying you should trust him blindly, but at least hear what he has to say

I don’t like reactors (I also don’t like the actor model but it at least performs better and is easier to understand). I am a huge fan of blocking code and the thread-per-request model. They are readable and take full advantage of the Java language. These frameworks take that away and you need a really good reason to use them post-Loom.

Reactor frameworks are effective at addressing application throughput issues. They create abstract tasks (similar to coroutines) and encapsulate everything within them. The reactor runtime then manages these tasks. This approach sounds quite similar to virtual threads, but there are some major drawbacks:

  • The language doesn’t natively support it, resulting in complex code (such as Flux/Mono).
  • The robust Java thread debugging capabilities we discussed earlier are bypassed, replaced by a centralized error handler that provides almost no useful information about what went wrong. As a result, developers have to rely heavily on logging.
  • Once you adopt a reactor-based approach, it’s very difficult to revert to traditional threading; you might end up having to rewrite everything from scratch.
  • Brian Goetz suggests that Loom could render WebFlux obsolete. While you shouldn’t blindly accept his opinion, it’s worth considering his perspective.

Personally, I don’t like reactors (and I’m not a fan of the actor model either, even though it’s easier to understand and performs better). I like sequential code and the thread-per-request model. They are more readable and leverage the full potential of the Java language. Frameworks like Reactor compromise this simplicity, and you would need a compelling reason to use them in a post-Loom world.

Futher reading and links

--

--