Distributed Tracing in Polyglot Microservices

Rajkumar Venkatasamy
13 min readFeb 28, 2022

--

Hello Techies!!! In this post, I will share the details on how to set up distributed tracing in polyglot microservices. The source code related to this topic of discussion is accessible in this GitHub link https://github.com/rajkumarvenkatasamy/tracing-polyglot-services

Before we get into the details, here is the list of assumptions and prerequisites:

Assumptions

Following are the assumptions that the readers of this post possess:

Exposure to

  1. Microservices
  2. Spring cloud libraries and how to develop microservices using them
  3. Python

Prerequisites

Following are the prerequisites to build and execute the demo project/extend the demo project given in GitHub

  1. JDK 1.8
  2. Python 3
  3. Docker

Note: The build and execution scripts are directly compatible with Windows OS. However, it should be easy enough for you to convert it to shell or other scripting formats.

Problem Statement

The microservices architecture is a powerful design pattern for breaking down complex monolithic applications into smaller but more manageable pieces. These so-called microservices can then be built and deployed independently of each other. However, as with the famous Spiderman movie quote, “Remember, with great power comes great responsibility.”, here with the power that we get through microservices design patterns, we have the responsibility to manage certain complexities that come up associated with such architecture.

Let’s consider a sample microservices based application designed as shown in the below Figure 1, to illustrate the problem statement:

Figure 1: Sample Distributed Microservices application

To simplify things, only a few components in the entire technological stack are captured in the diagram.

As microservices are by nature deployed in a distributed environment setup, one service could make a call to another dependent service to complete a transaction (be it a read or write transaction).

For instance: As per Figure 1, when a user requests a REST endpoint via API Gateway, the request gets routed to Microservice A (developed in Java), that service, in turn, produces an event to Kafka topic and then in turn makes a call to an RDBMS and finally sends the extracted data as part of the external call to Microservice C developed in Python.

In the case of a distributed architecture based applications, to get the job done, the transaction involved might need to be dealt with more than just two microservices spanning across multiple machines and data stores (database, storage queues, etc). It would be hard at times to narrow down where the latency is occurring or which microservice or component in the entire distributed application ecosystem is causing the delay in getting the response or encountered failure.

For instance: As per Figure 1, we can see that each of the Microservices (A, B, and C) is deployed in two servers, and let’s say there is a complaint from users dealing with Microservice A, that the response is taking more than a minute.

How can a person troubleshooting the issue, quickly understand where the latency is?

Even if the person tries to follow through the logs in the servers, which specific server to start and how to trace the trail?

It will be challenging to do so, since we are dealing with multiple components such as API Gateway, Microservice A, Kafka, and Microservice C and they all run on different servers. Also, the complication level increases to another fold or level, when the microservices are developed in different languages.

We need a common library or tool that can consolidate and pinpoint the problematic component in such a diversified ecosystem. This is the very scenario that we will try to address as part of this article with the help of tracing libraries that are compatible with polyglot microservices, developed and supported for the most popular programming languages.

Solution

Distributed tracing comes as a solution to the above-stated problem. How does it solve the problem? The answer is

  • By employing a Correlation ID to link together transactions that span across multiple services
  • By collecting and sending a minimal but required amount of trace data (minimal in size as the data collected is only a kind of metadata) from multiple microservices into a single searchable application
  • Visualising the flow of a user request across multiple microservices and understanding the response time of each part of the entire transaction

Key terms related to Distributed Tracing

While dealing with distributed tracing, we would often come across the following terms:

Trace ID: Trace ID is the equivalent term for Correlation ID. It helps us to correlate all the logs from different microservices generated during the processing of requests. Trace ID is a unique number that represents an entire transaction. We will be using this Trace ID later to trace a particular distributed request in our demo project.

Span ID: A Span ID is a unique ID that represents part of the distributed transaction. Each service participating within the distributed transaction will have its Span ID.

Figure 2: Sample Trace information of a Distributed request

Any distributed requests could be traced just by tracking these simple metadata information such as Trace ID, Span ID, and the timestamps of request start and request completion time. This is a lightweight solution (as it only deals with a kind of metadata information), and it differs from the collection and analysis of distributed log messages.¹ Also, the use case of Distributed log messages is different from that of the distributed tracing solution.

Demo Project — Technical stack

Figure 3: Technical stack overview

Figure: 3 shows the technical stack of the project that we are going to use for the demonstration of distributed tracing in polyglot microservices.

Our goal is to collect and transfer the request tracing data from Java and Python applications to a Zipkin server. Then we will use Zipkin UI to correlate and lookup trace information at the time of need.

Here is the overview of each component in our demo project:

  • Zipkin server : Used for collection and lookup of distributed tracing data. The lookup/search is usually done through Zipkin UI. We will gather timing data in the Zipkin server to troubleshoot latency problems in service architectures. As part of the demo project, I have not added a persistence/storage layer to store the trace information. All the trace information is stored in, In-memory of the Zipkin server. In production, you would need to think about the storage strategy. Please go through the Zipkin documentation regarding the same.²
  • Java-based Spring Cloud Microservice : Following are the significant Spring cloud libraries used in this demo project:
Netflix Eureka Server : To serve as Service RegistryGateway : To serve as API Gateway and Load balancerNetflix Eureka Client : To register the API gateway and Java microservice application as a eureka client
  • Python flask framework based Microservice : Here, py-eureka-client is added as a dependency to register the python based application as Spring Cloud Netflix Eureka client. py_zipkin is added as a dependency for gathering and transferring trace data to the Zipkin server.
  • Spring Cloud Sleuth³ with Zipkin : Instrumentation library for generating and sending trace information from Java-based Spring Cloud microservice application to Zipkin server
  • py_zipkin⁴ : Instrumentation library for capturing and sending trace information from Python-based Flask microservice application to Zipkin server

Demo Project — Source Code Highlights

The source code used for this demo is available in the github repository. (https://github.com/rajkumarvenkatasamy/tracing-polyglot-services)

Project structure

Figure 4: Demo Project Structure

The demo project is named as tracing-polyglot-microservices. We have already seen an overview of certain modules as part of the Technical stack section.

The build-and-execute directory consists of scripts for building the docker container images of each application (both java and python) and running all the applications of this demo project.

Changes related to Distributed tracing in Java application

To enable the collection and transfer of tracing information from the java-spring-reactive-service application to the Zipkin server, we need to add the following dependencies in the build.gradle file.

implementation 'org.springframework.cloud:spring-cloud-starter-sleuth'implementation 'org.springframework.cloud:spring-cloud-sleuth-zipkin'

Spring Cloud Sleuth automatically adds HTTP headers X-B3-SpanId and X-B3-TraceId to all the requests and responses. Also, it enables the automatic addition of trace and span information in the log statements that we define in our application.

For instance: Just a regular log statement like this in our application

log.debug("Inside ExternalService.getGreetingFromPythonFlaskService method of java app");

will lead to logging of the statement as shown below in the log with trace and span information:

2022-02-16T13:04:58.406531400Z 2022-02-16 18:34:58.405 DEBUG [java-spring-reactive-service,7ab5c0bb52fbce1a,7ab5c0bb52fbce1a] 7 --- [or-http-epoll-2] c.r.j.service.ExternalService : Inside ExternalService.getGreetingFromPythonFlaskService method of java app

Just by adding the dependencies, we get so many capabilities in the application, that the developers need not write any code to start registering and collecting the trace information of a request.

Now the next step is to transfer these generated trace information to the Zipkin server and that’s also handled automatically, provided if we configure the Zipkin server URL in the Java application.properties as follows:

spring.zipkin.baseUrl=http://zipkin:9411

I have configured the build scripts to register the Zipkin service with the name of Zipkin and hence that same name is configured in our application properties.

After starting all the services of the demo project (using start.bat script, which we will see in the latter part of this article), you can access the Zipkin UI in your browser using the URL, http://localhost:9411

In a production application, you might have a spring cloud config server as part of the technical stack, to manage your common application properties. So in such a case, you can have this configuration information managed by the Config server. To have this article focussed on distributed tracing, I have restricted the demo project to deal with minimal required applications and libraries.

Changes related to Distributed tracing in Python Flask application

In the case of Spring Java application, we integrated distributed tracing features through adding dependencies and config tokens without writing any custom logic. However, in the case of a Python flask application, we need to put extra effort by writing some code in our application. This is one of the key reasons to write this article, so that the readers who are in need of such a solution in their architecture, can make use of the demo project. So let’s see how this can be achieved.

We need to add the following dependencies in the requirements.txt file of our application.

py-zipkin

py-zipkin is an instrumented library that is equivalent to Spring Cloud Sleuth library in the Java world. However, with py_zipkin, we need to deal with Python language but it is compatible with any Python framework including Flask with which we have a microservice developed in our demo project.

py_zipkin by default won’t automatically include and send trace information to the Zipkin server. As a programmer, you need to use the zipkin_span of py_zipkin library to start Zipkin traces or logging spans inside an ongoing request trace.

zipkin_span can be used as a context manager or a decorator in an application code that is using py_zipkin. I can understand that it might be difficult for some readers to interpret what I mean here, so let’s go through the code to understand this better.

We have to define a Decorator (Function wrappers) in python as shown below:

zipkin_span decorator

Here, zipkin_span is used as a context manager and here we have initialised some of its key attributes such as service name (application name), trace id, span id, span name, transport handler, etc.

If you notice, the trace id and span id information are retrieved from the request object of the Flask framework. And this request object is populated with such information with the help of Spring Sleuth in our calling application, which we had gone through in the earlier section. Just to set the context/to provide a recap, in our application stack, the python flask service acts as a downstream microservice that gets called from java-spring-reactive-service.

Another important aspect to define is transport_handler. Only with the help of transport_handler, py_zipkin knows how and where to transfer the trace information to. In the case of our demo project, we have defined a separate static method (ZipkinUtils.default_handler) as shown below as transport_handler

You can see that the default_handler method is defined with a post request to the Zipkin server endpoint, to transfer the trace information in the JSON format. The print statement is added for convenience in this demo project for the user to see the trace information in the JSON format in the docker container logs.

We need one more aspect to handle in our application and that is adding this decorator as part of Flask’s hook methods, such as before_request and teardown_request. This will automatically enable for every request to this application service, the trace information gets logged and sent to the Zipkin server.

With this, we have completed the list of changes to be carried out in the python application front for enabling distributed tracing features.

All the above-explained code is available as part of the python-flask-service of this demo project. There is a python class named ZipkinUtils (zipkin_utils.py) which holds this entire logic.

Demo Project — Build and Execute

I have detailed the instructions on how to set up your local machine for building the project and running the entire project in a README.md file, available as part of the demo project’s GitHub repo.

Given below is the request flow diagram which explains various endpoints available to test the distributed tracing feature.

Figure 5: Request Flow Diagram

After starting the services, open the Zipkin URL (http://localhost:9411) in your local machine’s browser.

Trigger the First endpoint

Trigger the first endpoint given in the README.md file by pasting the given URL in your browser or the rest-client tool/library of your choice. This is an endpoint that makes a call to the Java microservice app and returns the response to the user without making a call to other services.

Once you have got the response back, switch over to Zipkin UI and click on the Run Query button. You will be able to see the trace information as shown below:

Figure 6: Zipkin trace for the First endpoint

You can see that this request involves only one microservice. On clicking the show button, you will be able to see the trace id and span id registered for this request. This will be in line with the same trace information logged in your application log.

Trigger the Second endpoint

Now, trigger the second endpoint given in the README.md file which makes a GET request to Java microservice app, and that in turn calls the python flask service for getting the response back to the user.

Once you have got the response back, switch over to Zipkin UI and click on the Run Query button. You will be able to see the trace information as shown below:

Figure 7: Zipkin trace for the Second endpoint

The first one in the list from the top is the trace information registered for the second endpoint. Here you could see that this request involves two microservices applications and the total duration spent for the entire request. To see more details on this particular request, click on the corresponding SHOW button in the Zipkin UI.

A page as shown below will appear, where you could see the time spent on each microservices involved in this request.

Figure 8: Zipkin trace — Detailed view for the Second endpoint

Clicking on the Show All Annotations button on the right-hand side will display the details on which Server IP has got this request and the corresponding start time and finish time of the request. In this way, we get a clear picture of which among the nodes in a production multi-node cluster has got the request and how much time has been spent on that particular application.

Trigger the Third endpoint

Now, trigger the third endpoint given in the README.md file which makes a GET request to Java microservice app, and that in turn calls the python flask service for getting the response back to the user. This is almost similar to the second endpoint except that I have introduced artificial latency in both the java and python applications front, so that when checking in Zipkin UI, you can clearly understand which microservice had taken a long time to respond.

Now fire the request and once you have got the response back, switch over to Zipkin UI and click on the Run Query button. You will be able to see the trace information as shown below:

Figure 9: Zipkin trace — for the Third endpoint

You could see that, even now with the third endpoint call, the same two microservices got involved, but this time, the total response time is just above Six seconds. On clicking the SHOW button, you will be able to see that the maximum latency is at the python-flask-service. Out of the six seconds, five seconds were taken by the python-flask-service for its processing.

Figure 10: Zipkin trace — Detailed view for the Third endpoint

Summary

That’s it!!! We have come to an end of this article. In this article we have explored

  • The problem statement and the need for distributed tracing
  • The Demo project
  • Triggered distributed REST api calls andUsed Zipkin UI to trace which service the request had gone to
  • Which server had actually got the request among the many and
  • How much time it took for a particular service in that server to respond to the request.

Thus we had seen a demonstration of how to set up, collect the tracing information to a centralised location and perform an analysis in a distributed polyglot microservices project.

Citations

[1] : Distributed Logging vs Tracing

[2] : Zipkin Documentation : See Storage section

[3] : Spring Cloud Sleuth Documentation

[4] : py_zipkin

--

--