Distributed Transaction Tracing with Transaction Ids in JVM Services

Published in

ClassPass Engineering

7 min readFeb 16, 2022

When working in a microservices environment, a problem we frequently face is that it can be difficult to follow logs for a single transaction across services. Application logs are often one of the first places we look when performing triage for a bug or trying to understand behavior our systems are exhibiting. This becomes challenging in scenarios in which we’re dealing with distributed transactions since logs become scattered, and in order to see what’s happening across different services in a single transaction, we are forced to resort to matching logs between the services based on timestamps or other data related to the transaction. Not only is this time consuming and daunting, but it can cost us precious time when debugging critical issues.

Here at ClassPass, we’ve recently undergone a push to adopt use of transaction identifiers for tracing distributed transactions with our logs. The idea is that a unique id can be generated for each transaction and included in all logs written for that transaction, allowing logs to be searched by transaction id.

The general approach we took involves using an established transaction id request header that we include in all service calls, and configuring our services to get the transaction id from this header on incoming requests, or generate a new transaction id if it wasn’t included in the header. Services are then responsible for including the transaction id header in any outgoing service calls. Some guiding principles of our implementation:

We don’t want to require explicitly passing the transaction id everywhere
It should be easy to configure a service to handle and forward transaction ids

Technical Details

We have 2 primary web frameworks that we use at ClassPass for JVM services: Dropwizard and Ktor. Setting up transaction ids for Ktor is fairly straightforward, so we will focus on the approach taken for Dropwizard here.

Using MDC to hold transaction ids

For logging, we want to provide our transaction id to the MDC (Mapped Diagnostic Context), which enables us to add context to log messages without explicitly needing to include it in log statements. The main thing to note is that the MDC is essentially a thread local map that stores context for logs. We can reference MDC keys in our log template, which results in the value of that key being written in all log statements. As an example, suppose we have the following logging template (note that we’re using slf4j/logback):

%d{HH:mm:ss.SSS} transactionId=%X{transaction_id} %-5level %logger{36} - %msg%n

Then %X{transaction_id} would be replaced with the value under the key transaction_id in the MDC. Consider the following code:

import org.slf4j.LoggerFactory
import org.slf4j.MDC

val logger = LoggerFactory.getLogger("main")
fun main() {
    MDC.put("transaction_id", "abc123")
    
    logger.info("hello!")
}

The log statement would show up as:

00:00:00.000 transactionId=abc123 INFO main - hello!

Placing transaction id in MDC

The main goal was to set the transaction_id key in the MDC for all requests. We did this by implementing a servlet filter that gets the transaction id from the transaction id header on an incoming request, or generates it if the transaction id header is not present, and adds it to the MDC.

Note that dropwizard is based on Jersey, and with Jersey apps, requests are handled by a dedicated thread (unless you’re using the async api). Since MDC is thread local, we can expect all logs for a given transaction to include the transaction id as long as the log statements are in code that’s executing on the request handling thread.

Forwarding transaction ids

Beyond capturing or generating transaction ids for incoming requests, we also need to ensure we’re forwarding the transaction id in any further service calls we make, but we don’t want to require all service calls to explicitly pass the transaction id value. The details here depend on the client you use, but most clients have some way to apply global configuration to all requests. The general idea is that we would add a global request modifier to each http client that adds the transaction id header for us.

Making transaction id forwarding work with OkHttp

We make heavy use of OkHttp for http clients. OkHttp has a construct called Interceptors, which can be used to add headers to every request before they’re sent. These can be useful for things such as setting the User-Agent header on all requests, or in our case, setting the transaction id header. Configuring a client with an interceptor for our use case would look something like:

OkHttpClient.Builder()
    .addInterceptor {
        val modifiedRequest = it.request()
            .newBuilder()
            .addHeader(
                "x-cp-transaction-id",
                MDC.get("transaction_id")
            )
            .build()
        
        it.proceed(modifiedRequest)
    }

An issue we faced here is that the interceptor code runs in the OkHttp client’s thread pool, thus our transaction id, which is local to the request handling thread, isn’t in the MDC in the interceptor code. To get around this, we took advantage of the ability to supply a custom executor service for the client to use as a dispatcher, and built a special executor service for capturing the transaction id.

An executor service to capture transaction ids

We ended up building a delegating executor service for this. The idea is that a Runnable instance that’s submitted to an executor service is constructed on the thread submitting the Runnable, and executed on a thread from the executor services pool. In our case, the Runnable would be constructed on the request handling thread and executed on a thread from the client’s thread pool. We built a delegating Runnable that captures the transaction id value and stores it on the instance when it’s constructed, then in the run method, we put the transaction id value in the MDC to make it available on the OkHttp client’s thread. The delegating runnable was implemented as follows:

class TransactionContextCapturingRunnable(private val runnable: Runnable) : Runnable {
    // This is constructed on thread that is submitting work to executor service, in our case this would be the request handling thread
    private val transactionIdValue = MDC.get("transaction_id")

    override fun run() {
        // This will run in an executor service thread, so we save the existing value of the transaction id on this thread so we can reset when we're done, and then put the captured value in the MDC under the transaction_id key
        val valueBefore = MDC.get("transaction_id")
        MDC.set("transaction_id", transactionIdValue)

        try {
            runnable.run()
        } finally {
            MDC.set("transaction_id", valueBefore)
        }
    }
}

From here, we just needed to implement a delegating executor service that wraps all Runnables submitted to it with TransactionContextCapturingRunnable, so we ended up with this:

class TransactionContextCapturingExecutorService(
    private val executor: ExecutorService
) : ExecutorService by executor {

    override fun execute(command: Runnable): Unit = executor.execute(TransactionContextCapturingRunnable(command))
}

We then configured all of our OkHttp clients that call internal services to use the TransactionContextCapturingExecutorService as a dispatcher, which unlocks the ability to use an interceptor to forward the transaction id with all requests.

Logging transaction ids in async code

Some of our endpoints involve asynchronous processing where work may be done on various threads. In cases where work is handed off from the request handling thread, we would lose the transaction id from the MDC. The main scenarios in which we encountered this involved passing work to executor services and coroutines. For executor services, this was easy to solve by using our delegating TransactionContextCapturingExecutorService.

Transaction ids and coroutines

Kotlin coroutines offer concurrency with a high level api and constructs that provide a safe way to manage concurrent operations. Coroutines have an associated CoroutineContext, which is a set of elements that can be used to manage how coroutines run. We were able to take advantage of the kotlinx-coroutines-slf4j package which provides integration utilities for slf4j and coroutines. In particular, the package provides an MDCContext class which brings the MDC in as a CoroutineContext element.

Whenever we pass work to a coroutine, we just need to provide the MDCContext element, and all logs written within that coroutine, and any child coroutines, will include our transaction id. For example, consider the following code:

MDC.put("transaction_id", "abc123")

// Assume we're operating on Jersey's request handling thread here
// The context map will include our transaction id
val mdcContextMap = MDC.getCopyOfContextMap()

// runBlocking acts as a "bridge" into coroutines
// By adding the MDCContext argument, we ensure our MDC context is captured in the coroutine created here.
runBlocking(MDCContext(mdcContextMap)) {
    // We're now in a coroutine
    logger.info("hello!")
}

Assuming the same logging config as above, the log statement within the coroutine created by runBlocking will display as:

00:00:00.000 transactionId=abc123 INFO main - hello!

Making adoption in new services easy

We wanted to make it as easy as possible to configure new services to configure transaction ids, which we solved by bringing in shared tools that handled most of the boilerplate.

One of the tools was a servlet filter that handles capturing the transaction id from the incoming request headers, or generating a new transaction id if it’s absent, and putting the transaction id in the MDC on the request handling thread. With this, we essentially just need to add a single line to our application startup code:

env.servlets().addFilter(
    "TransactionIdMDCFilter",
    TransactionIdMDCFilter()
)

For configuring OkHttp clients to forward transaction ids, we created an extension function useTransactionIdCapturingDispatcher on OkHttpClient.Builder that uses the TransactionContextCapturingExecutorService under the hood. This made it so that clients only needed to add a single line to the OkHttp configuration block

OkHttpClient.Builder()
    ...
    .useTransactionIdCapturingDispatcher() // Adding this line handles all transaction id forwarding concerns

Outcomes

We were able to get the vast majority of our services configured for transaction ids in a short period of time. The push to incorporate transaction id handling in our services was partially motivated by the upcoming high season, which generally falls around the new year when people have a renewed focus on their fitness goals. During this time we expect high traffic through our acquisition flows driven by ad campaigns, and beyond ensuring our systems are in a healthy state and scaled properly to handle increased traffic, we also want to ensure we’re prepared to quickly triage and resolve critical issues. Transaction ids in logs are extremely valuable towards enabling us to quickly triage distributed transactions to identify where a flow is breaking so we can rapidly make repairs.

As the world recovers from the pandemic, ClassPass is roaring back and we’re excited to be in the front seat of the recovery of the fitness/wellness industry. Come work with us!