Stories by Sakshem on Medium

The Hidden Cost of Virtual Threads: When Your Performance Gains Break Your Downstream Services

Sakshem — Wed, 24 Dec 2025 07:02:25 GMT

Java 21 brought us virtual threads, and they delivered exactly what they promised. Our services could suddenly handle thousands of concurrent requests with minimal memory overhead. The incoming request queue that used to pile up during peak hours? Practically empty. Response times? Consistently low. It felt like a free lunch.

Until it was not.

What Are Virtual Threads Anyway?

Traditional Java threads (platform threads) are mapped 1:1 with operating system threads. Creating thousands of them is expensive. Each one consumes around 1MB of stack memory, and context switching between them is costly. This is why servlet containers like Tomcat default to a thread pool of around 200 threads.

Virtual threads flip this model. They are managed by the JVM, not the OS. They are incredibly lightweight (a few KB each), and the JVM can create millions of them. When a virtual thread blocks on I/O, the JVM parks it and reuses the underlying carrier thread (which runs on actual CPU cores) for other work.

In Spring Boot 3.2+, enabling virtual threads is embarrassingly simple:

spring:
  threads:
    virtual:
      enabled: true

That single line transforms your blocking servlet container into something that behaves almost like a reactive framework while keeping your familiar imperative code.

The Success That Became a Problem

After upgrading our entry point service to Java 21 with virtual threads, everything looked great. During peak load, our observability metrics showed the request queue size at zero. Every incoming request was immediately assigned to a virtual thread and processed. No more bottleneck at the thread pool level.

The peak request queue size dropped from 40–80 requests down to zero consistently. We could handle thousands of concurrent requests without breaking a sweat.

Then users started seeing “Something went wrong” errors. Our gateway was tripping circuit breakers. Something was very wrong.

The Investigation

Digging into the logs revealed a troubling pattern. The service under investigation was making HTTP calls to a downstream service for data processing. The response times for these calls were catastrophic:

Downstream API response times:

Before the fix: Some requests taking 60–80 seconds to complete during peak traffic
After the fix: ~2ms average response time
Total calls in 12 hours: 2.94 million requests
Request rate: 68–106 requests per second sustained

What made this particularly interesting is that the downstream service does not perform any blocking I/O operations. It is pure CPU work: data transformation, computation, and processing. Yet it was taking over a minute to respond.

The service under investigation was configured with these HTTP client settings:

webclient.connection.timeout=20000
webclient.read.timeout=30000
webclient.maxConnection=200
webclient.pendingAcquireTimeout=30000
webclient.pendingAcquireMaxCount=500

Many requests were failing with ReadTimeoutException after waiting 30 seconds. The downstream service, still running on traditional platform threads, simply could not keep up.

Here is what was happening:

The current service could now accept 3000+ concurrent requests because virtual threads are cheap. But two bottlenecks emerged:

HTTP Connection Pool: Only 200 outbound connections available. Requests compete for these connections, with many waiting up to 30 seconds (pendingAcquireTimeout).
Downstream Thread Pool: The downstream service running on platform threads could only actively process around 200 requests simultaneously. Everything else queued up in the servlet container’s accept queue.

The combination meant requests would wait for a connection, then wait again for the downstream service to have an available thread, resulting in 60–80 second response times.

The Surprising Part: No Blocking I/O Required

The downstream service was doing pure computational work. No database calls, no external API calls, just CPU-bound operations. Yet it still suffered from thread starvation.

This reveals an important insight: virtual threads help most with I/O-bound workloads where threads spend time waiting. But when your downstream service is CPU-bound or simply cannot process requests fast enough, virtual threads in the current service just amplify the problem by removing the natural rate limiting that platform thread pools provided.

The Solution

The fix is straightforward once you understand the problem: upgrade your entire call chain.

We enabled virtual threads on the downstream service:

spring:
  application:
    name: processing-service
  threads:
    virtual:
      enabled: true

server:
  tomcat:
    threads:
      max: 200
    max-connections: 10000
    accept-count: 1000
    connection-timeout: 20000

The critical changes here:

max-connections: 10000 allows the server to accept many more concurrent connections
accept-count: 1000 sets the queue size for connections waiting to be accepted
Virtual threads enabled means the threads.max: 200 limit no longer applies. Each accepted connection gets a lightweight virtual thread immediately, and the JVM manages execution across available CPU cores.

After the upgrade, the results were dramatic:

After both services on virtual threads:

Average response time: 2.035ms
p99 response time: ~5.3ms
Peak request queue size: 0 requests (both services)
Total requests handled in 12 hours: 2.94 million without breaking a sweat
Sustained rate: 68–106 requests per second with room to spare

The downstream service went from timing out after 60–80 seconds to responding in 2 milliseconds. That is a 30,000x improvement.

How to Detect This Before Production Breaks

If you have proper observability, look for these metrics:

Request Queue Size: Track how many requests are waiting for a thread. With virtual threads, this should be near zero. If your downstream service shows high queue sizes while your current service shows zero, you have an imbalance.
Response Time Distribution: Watch for bimodal distributions where some requests are fast but others timeout. This indicates intermittent resource exhaustion.
HTTP Client Pending Acquires: Monitor how many requests are waiting for an HTTP connection from the pool. This can reveal connection pool exhaustion even before downstream services show problems.

These metrics were invaluable. We could see the request queue at zero in the service under investigation but did not have visibility into the downstream service initially. Once we added the same observability, the problem became obvious.

Bonus: Async Logging Makes a Real Difference

While investigating performance, we noticed another bottleneck. Synchronous logging. Every log statement blocks the thread until the log is written. With thousands of concurrent virtual threads all trying to log, this becomes a serialization point.

The fix is to use asynchronous logging with LMAX Disruptor, a high performance inter-thread messaging library:

LMAX Disruptor uses a ring buffer and lock-free algorithms to achieve extremely high throughput. Log messages are written to the buffer immediately (non-blocking), and a background thread handles the actual I/O. In our testing, this contributed to keeping response times consistently low during high load scenarios.

Key Takeaways

Virtual threads work exactly as advertised. They remove thread limitations at your service boundary. That is both the benefit and the danger.
Even CPU-bound downstream services suffer. You do not need blocking I/O for virtual threads to create problems downstream. Any service that cannot process requests as fast as they arrive will buckle under the load.
Your service is only as fast as its slowest dependency. If you upgrade one service to handle massive concurrency, ensure the entire call chain can keep up.
Monitor the right metrics. Request queue sizes and response time distributions tell you where the bottleneck actually is. Our production dashboard shows peak queue size consistently at zero once both services had virtual threads.
Roll out strategically. Start with services that have fewer downstream dependencies, then work your way up the call chain. Or go top-down and upgrade all dependencies first.
Do not forget logging. With high concurrency, synchronous logging becomes a surprising bottleneck. Async logging with LMAX Disruptor is a worthwhile optimization.

Virtual threads are a powerful addition to Java. Our production metrics prove they work: zero request queues, sustained high throughput, 2.94 million requests in 12 hours with 2ms response times. Just remember that with great concurrency comes great responsibility for your downstream services.

Building a Reusable WebClient in Spring Boot: The Smarter Way to Handle HTTP Calls

Sakshem — Sat, 20 Dec 2025 14:59:36 GMT

Why WebClient?

If you’re building microservices with Spring Boot, you’ll eventually need to call other services. WebClient is Spring’s modern, non-blocking HTTP client that replaced the older RestTemplate. It plays nicely with reactive programming and handles concurrent requests efficiently without blocking threads. Simply put, it’s built for the way we write applications today.

The Problem with the Traditional Approach

When developers first start using WebClient, they typically create separate client classes for each downstream service they need to call. Your codebase ends up looking something like this:

@Component
public class UserServiceClient {
    private final WebClient webClient;
    
    public UserServiceClient() {
        this.webClient = WebClient.builder()
            .baseUrl("http://user-service")
            .build();
    }
    
    public User getUser(String id) {
        return webClient.get().uri("/users/" + id).retrieve().bodyToMono(User.class).block();
    }
}

@Component  
public class OrderServiceClient {
    private final WebClient webClient;
    
    public OrderServiceClient() {
        this.webClient = WebClient.builder()
            .baseUrl("http://order-service")
            .build();
    }
    
    public Order getOrder(String id) {
        return webClient.get().uri("/orders/" + id).retrieve().bodyToMono(Order.class).block();
    }
}

See the pattern? Every client creates its own WebClient instance with its own configuration. Now imagine you have ten downstream services. That’s ten places where you’ve duplicated timeout settings, SSL configuration, logging filters, and connection pool settings.

When Spring releases a breaking change or you need to add request logging across all clients, you’re updating ten files. This is a maintenance nightmare waiting to happen.

A Unified Approach

The solution is straightforward: create one WebClient bean with all your configurations, and one common client class that handles the actual HTTP operations.

First, you configure WebClient once:

@Configuration
public class WebClientConfig {
    
    @Bean
    public ConnectionProvider connectionProvider() {
        return ConnectionProvider.builder("http-pool")
                .maxConnections(100)
                .pendingAcquireMaxCount(120)
                .pendingAcquireTimeout(Duration.ofMillis(16000))
                .maxIdleTime(Duration.ofMillis(150000))
                .evictInBackground(Duration.ofMillis(30000))
                .build();
    }
    
    @Bean
    public WebClient webClient(WebClient.Builder builder, 
          ConnectionProvider connectionProvider) {

        HttpClient httpClient = HttpClient.create(connectionProvider)
                .responseTimeout(Duration.ofMillis(15000))
                .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 5000);

        return builder
                .clientConnector(new ReactorClientHttpConnector(httpClient))
                .filter(logRequest())
                .filter(logResponse())
                .build();
    }
}

Then, one common client handles all HTTP operations:

@Component
public class CommonWebClient {
    
    private final WebClient webClient;

    public CommonWebClient(WebClient webClient) {
        this.webClient = webClient;
    }

    public  ResponseEntity postRequest(String baseUrl, String endPoint, 
            HttpHeaders headers, Object requestPayload, Class responseType) {
        T response = webClient.post()
                .uri(baseUrl.concat(endPoint))
                .headers(h -> h.addAll(headers))
                .bodyValue(requestPayload)
                .retrieve()
                .bodyToMono(responseType)
                .block();
        return ResponseEntity.ok(response);
    }

    public  ResponseEntity getRequest(String baseUrl, String endPoint, 
            HttpHeaders headers, Class responseType) {
        T response = webClient.get()
                .uri(baseUrl.concat(endPoint))
                .headers(h -> h.addAll(headers))
                .retrieve()
                .bodyToMono(responseType)
                .block();
        return ResponseEntity.ok(response);
    }

    public  Mono postRequestMono(String baseUrl, String endPoint, 
          HttpHeaders headers, Object requestPayload, Class responseType) {
        return webClient.post()
                .uri(baseUrl.concat(endPoint))
                .headers(h -> h.addAll(headers))
                .bodyValue(requestPayload)
                .retrieve()
                .bodyToMono(responseType);
    }
}

Now any service can simply inject CommonWebClient and make HTTP calls without worrying about configuration details.

What Happens Behind the Scenes

Connection Pooling

This is where the real magic happens. When you create separate WebClient instances without shared configuration, each one typically maintains its own connection pool. That means if you have ten clients calling ten services, you could end up with ten separate pools, each opening and managing their own TCP connections.

With the unified approach using a shared ConnectionProvider, all your HTTP calls draw from the same pool. The connection provider manages:

maxConnections: The maximum number of connections the pool can hold
pendingAcquireMaxCount: How many requests can wait in line for a connection
maxIdleTime: How long unused connections stay alive before being closed
evictInBackground: Periodic cleanup of stale connections

Under the hood, Reactor Netty (which powers WebClient) uses an event loop model. Instead of one thread per connection, a small number of threads handle many connections through non-blocking I/O. When a connection sits idle, it returns to the pool for reuse rather than being closed and reopened.

Request Flow

When you call postRequest():

WebClient asks the connection provider for an available connection
If one exists in the pool, it’s reused immediately
If not, and the pool isn’t full, a new connection is created
If the pool is full, the request waits (up to pendingAcquireTimeout)
After the response is received, the connection returns to the pool

This reuse eliminates the overhead of TCP handshakes and SSL negotiations for every request.

Benefits of This Approach

Single source of truth: Timeouts, SSL settings, logging, and error handling live in one place.

Efficient resource usage: Shared connection pool means fewer open connections and better memory utilization.

Easier debugging: Centralized logging filters capture all outgoing requests and incoming responses.

Simpler upgrades: Library updates or configuration changes happen once.

Consistent behavior: Every HTTP call follows the same patterns for error handling and logging.

The Tradeoffs

Nothing comes free. Here are things to think about:

One configuration for all services: If service A needs a 30-second timeout but service B needs 5 seconds, you need to handle this. You can pass custom timeout values per request, or create multiple WebClient beans for different timeout profiles.

Shared pool exhaustion: If one slow downstream service holds connections, it affects all other calls. Monitor your pool metrics and set sensible limits.

Blocking calls in reactive context: The example uses .block() which waits for the response. This is fine in traditional servlet applications but problematic in fully reactive stacks. Consider returning Mono directly when working with WebFlux.

Mitigating the Drawbacks

You can still use this approach while handling edge cases:

Create a second WebClient bean with different timeouts for specific high-latency services
Add per-request timeout overrides using .timeout() on the Mono
Expose both blocking and non-blocking methods (the example includes postRequestMono for reactive use)
Use circuit breakers (like Resilience4j) to prevent one failing service from exhausting your pool

Things to Keep in Mind

When implementing this pattern:

Set realistic pool sizes: Too small and requests queue up. Too large and you waste resources. Start with defaults and tune based on actual load.
Configure idle timeout wisely: Keep it shorter than any proxy or load balancer timeout in your network path. Connections closed by intermediaries cause unexpected failures.
Add proper error handling: The common client should handle WebClientResponseException and other failures gracefully.
Include correlation IDs: Pass trace IDs through context for distributed tracing. This helps debug issues across service boundaries.
Monitor your pool: Reactor Netty exposes metrics. Watch for pending acquisitions and pool exhaustion.

The unified WebClient approach isn’t revolutionary, but it’s a pattern that pays dividends as your microservices architecture grows. You write less code, maintain less configuration, and spend more time on actual features.