Request tracing in microservices architecture — A Kleisli tale

Debugging microservices is a tedious task. A lot of components interact with each others and when a bug pops up identifying its root cause is complex. It gets even messier at scale.
It is difficult, but not impossible. Here, we will see how we can leverage a functional programming data type, namely Kleisli, to achieve such result.
The problem
Let’s consider the following architecture composed of four services: A, B, C and D.

In this setup, service A calls service B, service B calls service C and D then aggregates the result to do some business processing and returns the result to A. So far so good.
Now consider the following scenario:

There is a problem in production. You have to debug it, so the first thing you do is checking the logs.
Line number 5 indicates there is a problem related to the service B http call. So you decide to check the logs of service B.
Line number 6 indicates there is a problem to the service D http call. So you have to check the logs of service D.
Line number 6, we have finally found the bug. It turns out there was a connection issue with the database. Sure we are now on our path to fixing the issue, but we took the long and cumbersome road. We had to go through the logs of three different services (that we might not even own) to try and correlate log traces manually (checking that the timestamps match and drawing conclusions that this line is correlated to our initial problem).
By the way: did you notice? The timestamps in the logs don’t even match. The time in the service B is off by 2 hours, which makes it even harder to find the bug.
This example is also mercifully simple, we are only managing four microservices here. Imagine what would happen with a lot more.
We need to avoid this much interpretation-heavy and error-prone exploration, but how?
The solution
A simple solution would be to add a correlation id to the logs, one that is propagated to all the subsequent calls. It would look like this:
Right next to the timestamp, a generated request id has been added.
With this technique, it is easy to grep all the logs matching an id and track down the source of the bug. We can easily see that the correlation id 1EgBpH1p
matches with the one from the service A logs.
Neat, but how do we implement such a solution? Let’s explore the tools we have at our disposal.
JVM Tools
The JVM ecosystem is pretty rich when it comes to tooling. Some tools exist to instrument code and to able to correlate log traces.
Generally speaking, those tools require a lot of specific knowledge. It means you need to invest time (sometimes a lot) in learning something that might not be reusable. It is also very library / framework specific.
Pros
- works out of the box if the framework and the libs fit
Cons
- side effects (it might cause OOM exceptions)
- Hard to setup
- framework specific
- specific knowledge
Thread MDCs
Thread MDCs (Mapped Diagnostic Context) are thread-local HashMaps and they store some context associated with the current thread. As long as a task is running on the same thread, you can access this context.
It is a good solution but it falters when you have asynchronous computations. There, the correlation id will not get propagated to the next thread (it won’t work with Future, IO, Task…). To make it work you would have to copy the context which is an expensive and tricky thing to do.
Pros
- simple to setup
- code changes are only located where the correlation id is needed
Cons
- does not work in a multithreaded context or needs tricky adaptation
- not test friendly
Argument passing
A more straightforward solution would be to just pass the correlation id in each method call such as:
Pros
- straightforward to implement
- test friendly
- work in a multithreaded environment
Cons
- spreads everywhere as all the methods will end up needing it
- adds a lot of noise in the method declaration and calling site
Implicit argument passing
Pros
- straightforward to implement
- work in a multithreaded environment
- method calling do not require to explicitly pass the parameter
Cons
- still noisy
- spreads everywhere as all the methods will end up needing it
Kleisli
Heinrich Kleisli was a swiss mathematician. He has been very influential in the field of category theory and you can find more about him on wikipedia. I am not really well versed into category theory so I am not going to discuss much of the topic here.
What matters is that Kleisli’s work has been translated into computer science and we can find the following data structure in the Cats / Scalaz libraries:
Let’s break it down:
F[_]
is theMonad
/Applicative
/Functor
of your choice such asFuture
orOption
orIO
.A
is what I call a “context” (or dependency)B
is the type you will get within yourF
To keep it simple, a Kleisli
is a data structure that encapsulates a function that given an A
results in a F[B]
. To get an F[B]
you have to provide an A
. So very naively explained, we introduce a dependency between the computation F[B]
and A
. Example:
if F[_]
is an IO
and A
is a CorrelationId
and and B
is a String
then we get
Kleisli[Option, CorrelationId, String](run: CorrelationId => IO[String])
To execute the run function, we have to provide a CorrelationId
(the “context”) and we will get an IO[String]
in return.
Kleisli properties
The cool thing about Kleisli
is that it keeps the properties of your F[_]
. It means:
- If
F[_]
is a Functor thenKleisli[F[_], _, _]
is also a Functor - If
F[_]
is an Applicative thenKleisli[F[_], _, _]
is also an Applicative - If
F[_]
is a Monad thenKleisli[F[_], _, _]
is also a Monad
Concretely it means that you will have access to map
, ap
(or product
/ zip
), flatMap
functions from your Kleisli[F[_], _, _]
construct depending on the properties of your F[_]
just as you would do with Future
, Option
and IO
.
Pretty cool huh ?
Kleisli operations
There are three main operations with Kleisli
: create a Kleisli
, run a Kleisli
, ask out of a Kleisli
its context.
That is a lot of information and you might be wondering how it is helping us track our log errors. So let’s get back on track.
Solving the correlation id problem with Kleisli
As we mentioned earlier we can see Kleisli as a way to introduce a dependency between a value and a computation. Another way to consider Kleisli
is to view it as a way to contextualise a computation. For our example, we can consider that CorrelationId
is the “context” in which an http request runs into. Which means that to get the result of a request, a CorrelationId
is required. So all our methods will have the form of Kleisli[IO, CorrelationId, _]
since everything runs “in the context of” a request.
Putting it all together it would look like this:
Since IO
is a Monad
, we can still map
/ flatMap
on a Kleisli[IO, _, _]
. If you look closely, this version is very close from the one with implicit parameters. The body of the createA
method is similar !
Pros
- test friendly (just provide an a context and get a regular
IO
- work in a multithreaded environment
Cons
- still verbose
The shorte(r/est) version of Kleisli
We’ve solved most our problems but we’re still writing a lot of code. To cut on repetition, you can make an alias out of it so it removes the noise all around your application.
If you want to be able to use all the Kleisli
methods with the shorter name RIO
you have to alias them once such as:
I personally like the RIO name because it is short and is still meaningful. RIO stands for Rich IO (implying it comes with some context).
You ca find some examples here:
- For a Play app
- For a http4s app
Looking at the future
Expressing a dependency between a computation and a context is a very common thing to do. So common that Dotty (eventually Scala 3) has a built-in construct called Context Function Type
to address this need. It introduces a new operator ?=>
. Here is a snippet of what you you will be able to do in the future :
I expect this feature to gain a lot of traction for all the reasons we mentioned in this article. Is RIO
dead ? Not quite:
Find more about it here: https://dotty.epfl.ch/docs/reference/contextual/context-functions.html
Summary
Kleisli
is a powerful construct to express a dependency between a context and a computation. Since it preserves the properties of your F
it is the perfect building block for your application. In this article we covered the log correlation issue but you might extend this example to add more “context” to your computations such as the current “user” executing the query… In the future, I expect the new Dotty feature “contextual function” to gain a lot of traction and to be a replacement such use case.