Debugging & Tuning Logback bottleneck in a high throughput JAVA Application

Sridhar Vinodarao
Sep 29 · 12 min read

This blog talks about one of our journeys to troubleshoot a performance issue in our application and will also go about explaining the concepts of Logback library’s Event Queue Management.

A few words to introduce our Service Ecosystem…

Push Marketing is one of the primary marketing channels in Groupon where we send over 350 million emails and 170 million push notifications on a daily basis to millions of customers. User Subscription Service (aka Global Subscription Service(GSS) as we all call it), is one of the cogs in the wheel of Push Marketing Platform which basically stores users’ location and product subscription information. We run a massive computation on a daily basis so that it can decide when and what to send to whom. Today, I am going to walk you through one of the many challenges that we faced working on systems of this scale and hope you could pick a clue from here when you see a similar challenge.

On the technical side, GSS is written in JAVA using Akka Actor framework and it runs on Play 2.3.9 to serve various HTTP request methods like GET, POST, PUT & DELETE. It uses Cassandra 2.0.14 at its backend to store users’ subscription data. It uses logback-steno as its logging framework.

The Errors !!

Image for post
Image for post

What you see here is a daily repetition of approximately 52k errors experienced by a Campaign Management (CM) Service when it calls our Subscription Service to get a list of subscriptions for millions of users. These errors are a mix of both timeout & connection reset (aka ECONNRESET) errors.

Now, the troubleshooting begins…

First things first, we looked at our service logs and noticed very high spikes in response times which match the exact time window when the Client (CM Service) experienced connection timeouts.

Host-wise p99 (ms) response time spiked up to 25k ms.

Image for post
Image for post

We analysed further as we still needed an answer for connection reset errors.

Load Balancer graphs

Image for post
Image for post

Here is an explanation for the anomalies you see in the Load Balancer (LB) graphs. When our service response times were high, all the existing connections between LB & our service hosts were being engaged for a very long time (max ~30 sec) by those slow running requests. At this time, LB spawned more new connections in a very high number (~3.5k, which is a lot) in a very short time to serve the constant incoming requests from upstream services while having to deal with high latencies from backend (our service) hosts. Load Balancer also did RESET some of the client connections where all those requests were just waiting for responses for a very long time, which caused ECONNRESET errors at upstream Service.

Now, we have an explanation for both timeout and connection reset errors seen at the client-side, we will investigate for the root cause of high latencies in our service.

JVM profiling

When I went over thread dumps taken during the high latency spikes time window, looking for the threads in the WAITING state, I observed that most of the Akka Actor threads are in WAITING(Park) state. But, as you can see in the below screenshot of JvisualVM, these threads are in Park state most of the time even when the application is functioning healthy without any response time spikes.

Image for post
Image for post

Identify the Thread WAIT use cases

So, I did a deeper analysis of thread dumps and found that the Akka Actor threads run into Park state in 2 different cases.

  1. When Actor threads are idle waiting for the work.
  2. An application method somewhere down the stack invoked LockSupport.park method.

Analysis of Thread dumps

Akka Actor thread waiting for work

Image for post
Image for post

Below are some of the application methods that caused the Actor thread moving to Park state.

Cassandra call

Image for post
Image for post

Logback file appender waiting to write to log

Image for post
Image for post

I wrote a small shell script that captures thread dump at each second and stores it in a file and names it with the current timestamp so that we can chronologically analyse thread dumps.

#!/bin/sh
i=1;
while [[ i -le $1 ]]; do
thread_dump_file=$(date "+%Y_%m_%d_%H_%M_%S")
echo $thread_dump_file
pid=`ps -ef | grep service_keyword | grep -v grep | awk '{print $2}'`
/usr/local/bin/jcmd $pid Thread.print > /home/user/thread_dumps/$thread_dump_file 2>&1
sleep 1;
i=$((i+1));
done

Grep thread dump files…

to count the number of times various methods causing WAITING state

Image for post
Image for post

As shown in the above screenshot, it’s the Logback method that made all the Application threads (300, that’s the number of Akka Actors set for our Play application) to run into WAITING status for about 47 seconds (from 2020_05_17_18_08_51 to 2020_05_17_18_09_38).

Now, let’s dive into internal workings of Logback to understand why Logback is putting all the Application threads in WAITING(Park) state.

Logback High Level Design

Image for post
Image for post

When Application Threads invoke, for example, the Logger.info() method, the call eventually goes to AsyncAppender’s append() method. This method publishes the incoming log events to an Event Queue which is a BlockingQueue (java.util.concurrent.BlockingQueue). Logback has a separate Worker Thread that consumes the events from the BlockingQueue and synchronously writes(appends) them to the log file on Disk. This queue helps in times when the number of incoming log append requests from Application Threads are occurring at a rate faster than the rate the actual writes are happening to the log file on Disk. Its size can be set by using the queueSize parameter.

If you are interested, here you can read more about each individual property available for AsyncAppender on the Logback official website.

This is our current Logging configuration

Image for post
Image for post

What is SizeBasedTriggeringPolicy ?

It ensures that the log file gets rolled over whenever its size reaches the set limit (set as 4 GB in our case).

Image for post
Image for post

Above code snippet shows an implementation of subAppend method in ch.qos.logback.core.rolling.RollingFileAppender class. Every time, before trying to append a message to the log file, it checks whether it’s time to roll over the log file. If true, it initiates rolling process for the log file, and then it appends the message to the log file.

How does the Log Rolling Process work?

Image for post
Image for post

The above code shows the implementation of the rollover method from the same RollingFileAppender class. It first acquires the lock on OutputStream to the active log file, closes OutputStream, initiates rolling process for the log file, re-opens the file and then releases the lock.

File Compression during Rollover

If your fileNamePatern contains .gz or .zip, the rollover process compresses the Log file during rotation. As of Logback version1.2.3, the compression process is carried out synchronously by FixedWindowRollingPolicy whereas SizeAndTimeBasedRollingPolicy which extends TimeBasedRollingPolicy both perform compression asynchronously in a separate Thread.

With that relevant introduction on Logback, I’ll talk about our bottleneck now…

Bottleneck explained

As shown previously, maxFileSize was set to 4 GB for one of our log files. And, In our logger.xml config file, no explicit configuration was set for the log event queueSize, hence, it uses the default Q size which is 256. As per the settings, every time a log file reaches 4 GB size, it starts the rolling process.

However, due to the large(4 GB) file size, the gzip compression process was taking anywhere between 55 seconds to 69 seconds to complete. Since we used FixedWindowRollingPolicy in our Application, this compression was taking place synchronously in the same Thread that locked OutputStream. And, this Thread will not release the lock until the compression completes. This caused Worker Thread to get blocked while trying to write to the same OutputStream to the Log file. As the Worker Thread got blocked and not consuming from the Event Queue, soon the Queue got piled up and reached its full capacity of 256 events. Since this Event Queue is a BlockingQueue, it puts all the Threads that call put() method in Park status and will not signal those Threads until at least one element is consumed from it by Worker Thread and there is space available for new events in the Queue. So, at some point, all of the Application Threads were stuck in the log append method.

This is a known issue with RollingFileAppender but its effects would become more evident in high throughput applications especially when the Log file size is huge.

Tuning Logback Settings

We tried to tune this bottleneck using 2 different approaches.

  1. Using the same FixedWindowRollingPolicy which performs compression synchronously.
  2. Using SizeAndTimeBasedRollingPolicy that performs async compression in a separate Thread.

I will talk about both the approaches below ..

FixedWindowRollingPolicy

The first thing I noticed was that we are running on an older version of Logback 1.0.2 and that with a latest version 1.2.3 comes few additional configuration settings that are not available in the older version. neverBlock is one such new setting that I couldn’t wait to try out.

Image for post
Image for post
Image for post
Image for post

Here, the neverBlock setting ensures that Application threads are not blocked when the event Q size is full. It unblocks the Application thread by skipping the requested log event instead. The test results showed complete recovery from all timeout errors, however, we did notice a loss of a very few log events. Though that log loss is just a very minor fraction, these logs are important to us because they are being forwarded down a data pipeline for analytics.

Image for post
Image for post

Above is a code snippet from AsyncAppenderBase class that shows how the neverBlock flag is used to avoid blocking the Application Threads. It uses nonblocking offer() method of BlockingQueue instead of put() method which is a blocking method.

Image for post
Image for post

So..(as shown above), I removed the neverBlock setting and used queueSize property. Load tested the Application for various values of queueSize from 256 up to 4096 (because the memory wasn’t an issue as we have enough Application JVM heap memory allocated to be able to easily accommodate all these Q settings). Despite this queueSize increment, the problem wasn’t solved, Application threads were still getting blocked. This also tells something about the logging frequency of our Application. It indicates that our Application is printing more than 4096 log events in about 8 seconds which leaves the Application Threads waiting for the remaining 47 seconds out of total 55 seconds of compression time.

As this setting also did not help, we decided to choose a right balance among the Logging configuration parameters.

Image for post
Image for post

This was an almost acceptable parameterisation of the logging configuration we had in place for FixedWindowRollingPolicy. I reduced the log file size limit from 4 GB to 1 GB to reduce the amount of time the compression process takes. At the same time, doubled the number of events Queue can hold to 512. We ran multiple load tests with a duration of up to 2 hours each. Test results showed that though this setting did not completely eliminate timeout errors to 100% it reduced them to 0.6% while ensuring there were no log event losses.

What is improved?

We solved 99.4% of the Problem. Here is why…

After deploying our Logback tuning fix to Production, the latency spikes that were previously at ~25k ms for about 52k requests each day, now contained within SLA of 1.5k ms for ~99.4% of the previously failing requests. Due to the trade-off we made to avoid the loss of some important logs, there were still 0.6% of them still crossing SLA but not more than 4k ms. That means, we reduced the overall timeout errors approximately by 99.4% and eliminated 100% of connection reset errors.

The below graph shows client timeout errors in a day were reduced to 367. And, no graph shown for connection reset errors as they had completely gone away.

Image for post
Image for post

P99(ms) latency spikes across hosts went only up to 4k ms for only 0.6% of previously failing requests.

Image for post
Image for post

Please note that if your Application hosts are relatively more powerful, the same tuning with FixedWindowRollingPolicy can eliminate 100% of these response time spikes during compression process. Though I have not done any benchmark performance testing exclusively for compress process itself, I noticed this difference in behaviour when we load tested our Application in different environments that have different capacity.

SizeAndTimeBasedRollingPolicy

While we were able to partially mitigate the problem with the FixedWindowRollingPolicy, we found that we could solve the issue completely by switching to SizeAndTimeBasedRollingPolicy that performs compression asynchronously. Since this policy eliminates the problem altogether by performing compression in a separate Thread, we didn’t have to further tune any special settings for improving performance itself. Simply switching to this policy and setting the corresponding properties accordingly did suffice.

This is how our new configuration looks.

Image for post
Image for post

What is improved ?

Now, since none of the Application Threads are blocked by the Rolling process, throughout this rolling process, the response times across all the hosts stayed below the normal limit of 250 ms. Consequently, it eliminated 100% of the timeout errors and connection reset errors as well.

Image for post
Image for post

Key Takeaways & Learnings

Though the final Logback tuning changes are very simple, the main takeaways here would be from the strategic troubleshooting steps we followed and the learnings we got from deep diving into inner technical workings of some of the key components of our Application. Below are a few items that I thought worth mentioning...

  1. If a high scale Application is experiencing high latency spikes or any other performance issues in a periodic manner, Log rolling setup is one such component which is definitely worth suspecting.
  2. When Akka Actor threads are in Park status, it doesn’t always mean that they are all idle and waiting for work. Your application code can also cause this Park status without you realising it when it invokes a method from a dependency library which internally invokes sun.misc. Unsafe#park method for some reason.
  3. It’s not always possible to conclude the root cause from the analysis of a single thread dump taken during the time of the incident. In such situations, adding a time dimension to your thread dump analysis by taking snapshots before, during & after the time of performance incident gives you good visibility about which threads pattern is normal & which isn’t.
  4. Since FixedWindowRollingPolicy does not have an async compress option(at least as of Logback version 1.2.3) and also if you do not have any strict need to stick to FixedWindowRollingPolicy, it’s worth considering either TimeBasedRollingPolicy or SizeAndTimeBasedRollingPolicy where the compression takes place asynchronously.
  5. Last but not least learning, “prevention is better than cure”. Akka Actor framework provides a highly scalable performance, provided we use it the right way. And to ensure that, one of the things we need to take care of is to try and avoid blocking operations inside Actors as much as possible. And for that, we need to understand the inner workings of dependency libraries you use for Logging, Database operations, Http requests, other I/O operations, etc. When unavoidable, it’s worth exploring options to handle those blocking operations asynchronously. You can refer to Akka documentation for more information on blocking operations.

Groupon Product and Engineering

All things technology from Groupon staff

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store