Distributed Tracing & Latency Analysis for Microservices

Mónika Lombos
6 min readJun 11, 2020

--

In the Spring Boot World

Preface

Have you ever felt the pain of debugging your services without any clue what could have gone wrong when something happens? When you don’t know how your perfectly designed microservices system could have problems.

Or another case is when you have multiple smoothly working services, though somewhere your process slows down and you need to know where you should scale up to handle the load faster.

Life happens to every exceptional service, too, nobody is perfect.

Let’s assume you have a full-fledged microservices structure, including services doing messaging on HTTP and messages queues with customized logging, metrics and alerts. This setup already looks quite sufficient and can help you a lot in taking user experience to a higher level by identifying and fixing your services’ issues more quickly.

BUT… and there is always a but… unfortunately, sometimes these hidden bugs or bottlenecks cannot be found easily. Causes can unintentionally catching a bug without (at least) logging it, timeout exception, a badly configured zuul proxy throughput settings, etc.

There is no doubt that developers can deal with these above-mentioned hidden bugs and flaws as well, still, the question is: How long would it take to identify the issue and solve it?

Distributed tracing comes into the picture and makes your life easier and helps prevent you from tearing your hair out.

Let’s narrow the topic down to Spring Boot Sleuth and Zipkin.

Sleuth provides Spring Boot auto-configuration for distributed tracing.

Zipkin helps gather the timing data needed to troubleshoot latency problems in service architectures.

Before we start let’s define the terminology (borrowing from Dapper’s)

  • span id The span represents a basic unit of work, for example sending an HTTP request.
  • trace id The trace contains a set of spans, forming a tree-like structure. The trace id will remain the same as one microservice calls the next.
  • baggage field/propagation field Baggage is a field that is propagated with the trace, optionally out of the process.
  • tags Like trace IDs, baggage is attached to messages or requests, usually as headers. Tags are key-value pairs sent in a Span to Zipkin.

Please note that Case I refers to only-sleuth project, whileCase II & Case III to sleuth-zipkin-elk.

Case I — only log correlation

If you don’t need latency analysis, you only need to include spring-cloud-sleuth dependency in your project. (Version should be managed via spring cloud BOM.)

A simple example can be found here in theonly-sleuth folder, but I would like to highlight the main points of the usage here.

Steps

  1. include spring-cloud-sleuth dependency
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-sleuth</artifactId>
</dependency>

This will add trace and span IDs to the Slf4J MDC, so you can extract all the logs for a given trace or span from a log aggregator.

2. define your application name

I like using ‘@name@’ placeholder since it will be automatically resolved by using spring-boot-starter from pom.xml in case of using maven, but you can use your own custom app name as well.

spring:
application:
name: '@name@'

And let’s stop just here.

Without any logging configuration, running the Spring Boot app will provide the span and trace ID for you out-of-the-box.

If you clone the example I mentioned above, you can try it now immediately. Go to the directory called server-one and run the app. Let’s call the only one GETREST endpoint defined in this ultimately simple app by doing a request toward http://localhost:8080 and then you can see the errors in the log.

Errors are occurring because server-one cannot call server-two via Feign client. (Since we haven’t started server-two.)

But notice the enriched log message:

2020-06-11 10:45:14.848 ERROR [server-one,efa635672cc15e49,efa635672cc15e49,false] 45414 --- [nio-8080-exec-1] o.s.c.s.i.web.ExceptionLoggingFilter     : Uncaught exception thrown

In this case, the brackets contain the following data:

[appName, traceId, spanId, exportable]

It’s that simple.

Case II — distribution tracing & latency analysis

As for those who are also interested in latency analysis by Zipkin, this sample can be handy in the sleuth-zipkin-elk folder.

Instead of spring-cloud-sleuth dependency, spring-cloud-zipkin should only be included, it brings sleuth as well.

Again, the sample is the source of truth, but I would like to emphasize the main points of the usage here.

Steps

1.include spring-cloud-zipkin dependency

<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-zipkin</artifactId>
</dependency>

2. define your application name in your services

spring:
application:
name: '@name@'

3. define zipkin base url

The default is http://localhost:9411/ .

spring:
zipkin:
base-url: ${ZIPKIN_BASE_URL:http://localhost:9411/}

After running docker-compose.yml in the given way (in the README.md file, the USAGE part shows you how to easily launch the apps), the latency analysis will look like this one:

Latency analysis via Zipkin

Notice that my example contains messaging as well via apache kafka and sleuth handled it smoothly. For HTTP Requests sleuth uses Filters and for Spring Integration, it uses ChannelInterceptors

Case III — custom request id applied via nginx

high-level design of sleuth-zipkin-elk stack

In such a case where you already have a unique request ID applied via nginx (or other web servers), you probably don’t want to use the sleuth provided trace ID instead of yours.

For such cases, you should know that you cannot override the sleuth generated one. What you can do is set up a propagated field config for sleuth, since as I mentioned in the preface, this field will behave in almost the same way as a trace id.

The difference between baggage and propagated fields is that the former needs baggage- (HTTP) or baggage_(Messaging) prefix in your custom keys, while the latter does not.

Steps

1.2.3: Apply the 3 steps from theCase II section!

4. define propagation keys and whitelist keys in your application.yml

spring:
sleuth:
propagation-keys:
- "lombocska-request-id"
log.slf4j.whitelisted-mdc-keys:
- "lombocska-request-id"

Remember that adding entries to MDC can drastically decrease the performance of your application!

mentioned here

In this case if you have a request like this one

curl --location --request GET 'localhost:8080' \--header 'lombocska-request-id: 987'

sleuth will propagate alombocska-request-id as well as its value.

As for the logging config, you can see here that ”requestId”: “%X{lombocska-request-id:-}” was added for the pattern.

Additionally, a containerized ELK stack attached to this project, thus nicely formatted, tagged logs are available when you run this project via docker.

kibana — enriched logs in discovery with the custom propagated field

Finally, I need to highlight that zipkin is able to show the dependency graph of your services. I used openzipkin/zipkin-dependencies docker image for this functionality.

zipkin dependency graph

--

--

Mónika Lombos

Experienced software engineer, founder of strollerina.com and LEGO Duplo (Sorry not sorry) fan. Love coding and practicing yoga from the heart of Budapest.