Distributed Tracing on Hybrid Cloud using Apache Kafka

Venkata Surya Lolla
Sep 15 · 6 min read
Image for post
Image for post
Blog illustration

The title sounds cool, right? I know, but what is this distributed tracing? I had the same question when I was asked to set one up for a client.

Let’s get some background

Jaeger is an OpenSource distributed tracing technology graduate by Cloud Native Computing Foundation (CNCF) used to monitor and troubleshoot microservice-based distributed systems for performance optimization, root cause analysis, service dependency analysis, and many more use cases. It comprises five components:

  • Client: A language-specific OpenTracing API that is implemented by instrumenting the applications
  • Agent: a network daemon that listens for spans** and sends it over to collectors
  • Collector: Validates, indexes and stores the traces received from the agents
  • Ingester: An integrated service between the Kafka topic and storage backend
  • Query: a service to retrieve the traces*** from a storage backend and hosts a UI to display them

* The OpenTracing API provides a standard, vendor-neutral framework for instrumentation. A developer can introduce a different distributed tracing system by simply changing the configuration of the Tracer in the code.

** Span represents a single unit of work that includes the operation name, start time, and duration

*** A trace is made up of one or more spans


What’s in the Cloud today?

Image for post
Image for post
What’s in Cloud today? (Figure 1)
  • Applications are instrumented with the Jaeger Clients (AWS Lambda Functions) to interact directly with the Jaeger collector to forward the spans
  • The Jaeger collector is deployed on an EC2 instance and configured with an AWS Managed Streaming for Apache Kafka (MSK) to validate, index and store the spans
  • The Jaeger Ingester was set up on an EC2 instance to read the spans from AWS Kafka and write it to the Elastic Search to view them on the Jaeger UI

The Requirement

Given the requirements, the initial plan was to send the traces directly from agents (on OpenShift cluster) to the collector (on AWS).

Sounded pretty straight forward at first glance, but a wrench was thrown in as I realized the data transfer between On-Prem applications and AWS has to be secure. I also found that there wasn’t enough bandwidth to send real-time span data from the Jaeger agent (on OpenShift cluster) to the Jaeger collector (on AWS). If the spans are backed up, the agents will drop the spans and the whole purpose will be defeated. Even though Jaeger supports gRPC TLS communication between the agent and the collector, bandwidth was a primary concern.


Whiteboard Session

  • On-Prem Data retention in case of connectivity issues or data queuing due to bandwidth limitations
  • Network bandwidth limitation between On-Prem Openshift Cluster and AWS

After hours of brainstorming, I came up with the following: well, to start off with,

Image for post
Image for post
Whiteboard Sketch (Figure 2)
  • Ensure the version compatibility between Jaeger, Kafka, and MirrorMaker
  • Install Jaeger components (only collector & agent) using it’s OpenShift Operator
  • Leverage the self-provision option in the Jaeger OpenShift Operator to auto-install the Kafka cluster(ZooKeeper, Kafka and MirrorMaker) by using a Strimzi Kafka Kubernetes Operator
  • Use MirrorMaker Kubernetes object provided by Strimizi Kafka Kubernetes Operator to replicate the OpenShift Kafka cluster events to the AWS MSK cluster

Note: It’s worth noting that the Jaeger collector or agent is not designed to handle the load when backed up by the spans, but a Kafka cluster can be used as a streaming service between the Jaeger collector and backend storage (DB) to offload the span data.

Note: To leverage the self-provisioning Kafka cluster option in Jaeger, a Strimiz Operator must be deployed in the OpenShift cluster before the Jaeger Openshift Operator deployment.


R & D

Trial & Error

  • Four components of Jaeger (agent, collector, ingester, and query)
  • A Kafka Cluster using Strimizi Operator
  • Backend storage (ElasticSearch)
Image for post
Image for post
Trial & Error (Figure 3)

As per the design (Figure 2), I only needed the Jaeger agent, collector and a Kafka cluster, but I realized that there is no option in Jaeger Openshift Operator to enable or disable the backend storage such as Cassandra or ElasticSearch, ingester, or the query components (Figure 3).

Alteration

Image for post
Image for post
Altered Sketch (Figure 4)
  • Collector has the Kubernetes Deployment, ConfigMap, and a Service YAML files
  • The agent has the Kubernetes Daemonset as it needs to run on every node in the OpenShift cluster and a Service YAML files

For the Kafka cluster, I used the Strimizi Kafka Kubernetes Operator to deploy a simple Kafka cluster and a Kafka topic.

Before deploying the Jaeger’s agent and collector to the OpenShift cluster using the raw Kubernetes YAML files, I set the backend storage type to Kafka with the Kafka Brokers and Kafka topic information in the Jaeger collector’s Kubernetes Deployment YAML file.

Hello World Spans

Image for post
Image for post
Hello World!! (Figure 5)

On the Jaeger side of the house, the agent was able to listen and batch the spans, the collector was receiving the spans from the agent. On the Kafka side of the house, the spans were actively streamed into the Kafka topic.

At this point, I confirmed that Jaeger successfully communicated and forward the spans to the Kafka.

Replication

I configured the MirrorMaker Kubernetes YAML file with the consumer (OpenShift Kafka) and producer (AWS MSK) clusters information and deployed it to the OpenShift cluster (Figure 5).

VOILA, It worked! I was able to read the sample Python application’s spans in the AWS’s MSK topics events.

The Ultimate Sketch

Image for post
Image for post
Conclusive Sketch (Figure 6)

The End

With the ultimate sketch, I wrap up the distributed tracing in a hybrid cloud using Apache Kafka blog without compromising the On-Prem Data retention and Network bandwidth limitation concerns.

I’m always up for a discussion; leave a comment below!!

Well, that’s it for now. See you again!

Happy Tracing 🚀🚀

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store