Request Tracing for Fun and Profit

How many times has this happened to you?

You get an email/ticket/instant message saying, “A weird, bad thing happened in production. Please investigate immediately!” You search the production logs, and find no obvious reason for the misbehavior. As software engineers, we often try to deduce what actually happened in production from terse logging. This can be like trying to find a needle in a haystack made of needles. A needlestack!

In lower-volume, non-production environments, we may be able to temporarily configure verbose logging in order to glean more information, but that’s usually a clumsy process. And it’s often not an option in production, where performance is critical and throughput must be high. We typically have to make painful compromises when deciding on what we can or can’t log in production. All of which hampers our ability to quickly fix critical problems with our systems.

What if we could get tons of detailed information only when we needed to? In this article I’ll show you how to do exactly that in Java with Logback or Log4j.

The Problem

The system I support exposes RESTful services critical to HomeAway’s e-commerce experience. It is composed of Java applications built using the Dropwizard framework. It’s part data-driven, part algorithmic, and all complicated. When it misbehaves, finding the cause is a tedious process of fetching data from systems of record, verifying the data, adding it as JSON documents to unit test fixtures, and then attempting to reproduce the problem locally under what I hope are similar circumstances. It may not work.

Instead, I could try to find the next occurrence by adding logging to the relevant transactions in the running system. But this system processes hundreds of thousands of transactions per second in production; if I were to add logging frivolously, I could degrade the system’s performance and overwhelm our logging system.

A Solution

I wanted to replay a request and specify, log this request verbosely, but only this one. This way, I could see why those misbehaving requests ended up the way they did. I also didn’t want to make extensive changes to our codebase to plumb through special request or response types, or invasively pass contexts around. And obviously, I mustn’t affect performance of other requests or create a colossal log file. Ideally, my solution would also leverage our existing frameworks with little additional risk.

The MDC (mapped diagnostic context) is a thread-local map-like object that contains information that is used by logging appenders. We can use it to communicate the only this one constraint on a given request.

The first step is to create a JAX-RS request filter that sets a sentinel value in the MDC for incoming requests bearing a special header. The request filter also removes the sentinel value because many servlet containers re-use threads, and I don’t want to accidentally trace code I didn’t intend to.

In my implementation, when a request has a header of “X-RequestTrace”, the MDC will contain a “requestLogLevelOverride” value of “trace”:

Using the MDC Key

Now that requests with “X-RequestTrace” headers have the special sentinel value in the MDC, we need to tell the logging framework to treat log messages differently when this sentinel value is present. The Logback and Log4J frameworks provide a means to invoke a logging message filter for all log requests. Logback, which we use at HomeAway, calls this mechanism the TurboFilter.

It’s a type of log message filter that is applied to all log messages, not just specific appenders as in the normal Filter class. There are several implementations of TurboFilter included with Logback that cover the most common use cases, and you can extend the TurboFilter class to create your own if necessary.

For the simplest implementation of request tracing, all we need to do is allow logging below the normal level (for instance, TRACE) for requests marked with the MDC key via the filter. Luckily, the MDCFilter implementation can accomplish this out of the box. Simply add this to your Logback config file:

Voila! With that in place, any messages logged during the scope of a request containing the header will be output, even if they were below the configured level of the given logger. This includes events from any logger namespace in the same thread marked with the MDC key. Most HomeAway applications store a unique request marker in the MDC, which correlates log messages between multiple apps and threads, so our super-detailed trace output will be correlated and easily searchable.

It’s not all good news, though. There are a couple of issues with this simple implementation:

  1. Every logger used by the thread is affected, even ones that don’t contain interesting information, which can dilute the valuable information with noise.
  2. The MDC is not automatically passed to child threads created using thread pools. Consequently, log messages in child threads will not log verbosely.
  3. There probably aren’t enough trace-level messages in your code to make a huge difference.

There’s no getting around #3: you’ll need to add lots of trace logging to your application to make this pay off. However, that’s a small price to pay for on-demand, detailed logging. Issue #2 can be fixed with special handling around thread pooling (which I’ll address later in this post). And the first can be avoided with a custom TurboFilter that I’ve implemented to improve the tracing.

What about Child Threads?

The MDC is not automatically inherited by child threads. So to enable operation tracing in child threads, you’ll need to provide the child thread with a copy of the parent thread’s MDC. Simply make a copy of the parent thread’s MDC and then have the child call setContextMap() with that value, at the beginning of the child’s code:

Thus, with only a few lines of code we can ensure that code on background threads outputs the same awesome, verbose diagnostics.

Filtering Out Noise

When troubleshooting production issues, we are primarily interested in code that we control. For instance, running the Spring framework at TRACE levels results in tons of uninteresting messages about the DI container that aren’t related to the business operation we’re trying to track and understand.

However, it’s also possible that some third-party components being leveraged are part of the code flow worth tracing. I’ve created a version of the MDCFilter class that allows for passing in a whitelist of namespace prefixes so that those components can be exposed. We use “com.homeaway” as a default because all our code has that package prefix, then explicitly add the third-party namespaces whose logs are of interest.

This means we can simply use the TurboFilter in our logback.xml with the namespace prefix set:

If necessary, the class could be extended to have more flexibility with namespacing, but I found that this provided an adequate level of control for making the trace have the information important to the trace.

Performance Considerations

We must ensure that the tracing code (tons of log.trace() statements) doesn’t affect the performance of our application. While it’s true that any line of code makes something run slower, Logback and other modern logging frameworks are highly optimized for massive message volumes. And the ability to understand what’s happening is worth a small tradeoff cost in performance. But it is important not to incur extra overhead for messages that may not valuable.

Do not do string manipulation that causes objects to have their toString() method called in an argument; instead, use the format overloads of the logger so that the creation of the messages happens lazily and will not happen unless the trace calls for it.

Don’t:

log.trace(“my request object:” + requestObj);

Do:

log.trace(“my request object: {}”, requestObj);

If there is tracing that is valuable, but can’t be represented via the format methods of the logger, wrapping the log statement in an if block is the best compromise to make:

if (log.isTraceEnabled()) {
/* costly tracing code */
})

Note that we strip the tracing header from all external requests. It generates massive amounts of logging, which could be used as a denial of service attack. We recommend that you strip it out when you receive it from untrusted sources.

Summary

Operation tracing enables you to get verbose logging, just when you need it. Enhance the logging in your complex systems with this approach, and add all the trace level logging statements that you need. Those logging statements will appear the moment you replay a request bearing the magic sentinel header. Your future self will thank you!