Request tracing in microservices architecture — A Kleisli tale

Louis Forite
Mar 18, 2020 · 7 min read
Image for post
Image for post

Debugging microservices is a tedious task. A lot of components interact with each others and when a bug pops up identifying its root cause is complex. It gets even messier at scale.

It is difficult, but not impossible. Here, we will see how we can leverage a functional programming data type, namely Kleisli, to achieve such result.

The problem

Image for post
Image for post
A basic microservices setup

In this setup, service A calls service B, service B calls service C and D then aggregates the result to do some business processing and returns the result to A. So far so good.

Now consider the following scenario:

Image for post
Image for post

There is a problem in production. You have to debug it, so the first thing you do is checking the logs.

Service A logs

Line number 5 indicates there is a problem related to the service B http call. So you decide to check the logs of service B.

Service B logs

Line number 6 indicates there is a problem to the service D http call. So you have to check the logs of service D.

Service D logs

Line number 6, we have finally found the bug. It turns out there was a connection issue with the database. Sure we are now on our path to fixing the issue, but we took the long and cumbersome road. We had to go through the logs of three different services (that we might not even own) to try and correlate log traces manually (checking that the timestamps match and drawing conclusions that this line is correlated to our initial problem).

By the way: did you notice? The timestamps in the logs don’t even match. The time in the service B is off by 2 hours, which makes it even harder to find the bug.

This example is also mercifully simple, we are only managing four microservices here. Imagine what would happen with a lot more.

We need to avoid this much interpretation-heavy and error-prone exploration, but how?

The solution

Service A logs with correlation id

Right next to the timestamp, a generated request id has been added.

Service B logs with correlation id

With this technique, it is easy to grep all the logs matching an id and track down the source of the bug. We can easily see that the correlation id 1EgBpH1p matches with the one from the service A logs.

Service D logs with correlation id

Neat, but how do we implement such a solution? Let’s explore the tools we have at our disposal.

JVM Tools

Generally speaking, those tools require a lot of specific knowledge. It means you need to invest time (sometimes a lot) in learning something that might not be reusable. It is also very library / framework specific.

Pros

  • works out of the box if the framework and the libs fit

Cons

  • side effects (it might cause OOM exceptions)
  • Hard to setup
  • framework specific
  • specific knowledge

Thread MDCs

Thread MDC usage example

It is a good solution but it falters when you have asynchronous computations. There, the correlation id will not get propagated to the next thread (it won’t work with Future, IO, Task…). To make it work you would have to copy the context which is an expensive and tricky thing to do.

Pros

  • simple to setup
  • code changes are only located where the correlation id is needed

Cons

  • does not work in a multithreaded context or needs tricky adaptation
  • not test friendly

Argument passing

Argument passing example

Pros

  • straightforward to implement
  • test friendly
  • work in a multithreaded environment

Cons

  • spreads everywhere as all the methods will end up needing it
  • adds a lot of noise in the method declaration and calling site

Implicit argument passing

Implicit argument passing example

Pros

  • straightforward to implement
  • work in a multithreaded environment
  • method calling do not require to explicitly pass the parameter

Cons

  • still noisy
  • spreads everywhere as all the methods will end up needing it

Kleisli

What matters is that Kleisli’s work has been translated into computer science and we can find the following data structure in the Cats / Scalaz libraries:

Taken from Cats / ScalaZ

Let’s break it down:

  • F[_] is the Monad / Applicative / Functor of your choice such as Future or Option or IO .
  • A is what I call a “context” (or dependency)
  • B is the type you will get within your F

To keep it simple, a Kleisli is a data structure that encapsulates a function that given an A results in a F[B] . To get an F[B] you have to provide an A. So very naively explained, we introduce a dependency between the computation F[B] and A. Example:

if F[_]is an IO and A is a CorrelationId and and B is a String then we get

Kleisli[Option, CorrelationId, String](run: CorrelationId => IO[String])

To execute the run function, we have to provide a CorrelationId (the “context”) and we will get an IO[String] in return.

Kleisli properties

  • If F[_] is a Functor then Kleisli[F[_], _, _] is also a Functor
  • If F[_] is an Applicative then Kleisli[F[_], _, _] is also an Applicative
  • If F[_] is a Monad then Kleisli[F[_], _, _] is also a Monad

Concretely it means that you will have access to map , ap (or product / zip), flatMap functions from your Kleisli[F[_], _, _] construct depending on the properties of your F[_] just as you would do with Future , Option and IO .

Pretty cool huh ?

Kleisli operations

Creating a Kleisli
Running a Kleisli
Asking the “context” out of a Kleisli

That is a lot of information and you might be wondering how it is helping us track our log errors. So let’s get back on track.

Solving the correlation id problem with Kleisli

Putting it all together it would look like this:

A logger using Kleisli
Services using Kleisli

Since IO is a Monad, we can still map / flatMap on a Kleisli[IO, _, _] . If you look closely, this version is very close from the one with implicit parameters. The body of the createA method is similar !

Retrieving or generating the correlation id
Flowing the correlation id to other services

Pros

  • test friendly (just provide an a context and get a regular IO
  • work in a multithreaded environment

Cons

  • still verbose

The shorte(r/est) version of Kleisli

If you want to be able to use all the Kleisli methods with the shorter name RIO you have to alias them once such as:

Alias Kleisli operations

I personally like the RIO name because it is short and is still meaningful. RIO stands for Rich IO (implying it comes with some context).

Replacing Kleisli[_, _, _] by RIO[_]

You ca find some examples here:

Looking at the future

I expect this feature to gain a lot of traction for all the reasons we mentioned in this article. Is RIO dead ? Not quite:

Leveraging the new Dotty contextual function

Find more about it here: https://dotty.epfl.ch/docs/reference/contextual/context-functions.html

Summary

iAdvize Engineering

Publications from the iAdvize engineering team :)

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store