Distributed Tracing using CNCF Jaeger and OpenTelemetry for Airtel Billing Layer Microservices

Abhinav korpal
Airtel Digital
Published in
6 min readJul 25, 2024

Introduction

In Airtel, we used CNCF Observability integration of Jaeger and OpenTelemetry to achieve observability in the Airtel Billing Layer for B2C with hundreds of microservices spread across on-premises DC regions for high availability.

OpenTelemetry with Jaeger backend for tracing and OpenTelemetry Collector for centralized data collection with 200+ microservices and 1000 Pods in OpenShift to instrumented monitoring using both OpenTelemetry Collector and Jaeger to identify any performance bottlenecks or errors.

To capture the most relevant data for troubleshooting billing issues while minimizing overhead. OpenTelemetry and Jaeger integration effectively handle our billing microservices' high volume and distributed nature while maintaining high availability for critical tracing data in multiple on-premises DC zones.

Note: screenshots are masked for privacy.

Background

The main agenda for this is to implement observability of multiple microservices API’s. Following these steps, you can leverage OpenTelemetry for agnostic observability in your microservices and gain valuable insights into their performance with Jaeger. This approach allows for a more robust and scalable monitoring solution with OpenTelemetry and Jaeger.

Who should read this document?

Monitoring / TechOps team; by implementing this, you’ll leverage the OpenTelemetry Collector to efficiently collect telemetry data from your microservices and export it to Jaeger for centralized tracing and monitoring.

Architecture and Deployment Modes

To implement this, we need to run the OpenTelemetry (OTel) agent in the microservice that will finally export all the traces/metrics to the OpenTelemetry tool which will in the end be transferred to the CNCF Jaeger tool.

OpenTelemetry Tool will have 3 components present in it, i.e., receiver, processor, and exporter, here receiver will be HTTP OTel and processor will be span, and the exporter will be Jaeger.

Finally, as soon Jaeger has the results in it, you can view or analyze the traces and error/success rates for the microservices API’s.

Developers can leverage a custom OpenTelemetry (OTel) agent within their microservices to create custom metrics and traces for application monitoring. This allows them to capture specific data points beyond the default metrics provided by OpenTelemetry libraries.

Instrumentation enables tracing requests across multiple data centers. You can visualize the complete flow of a request, pinpointing where latency is introduced. This helps identify performance bottlenecks and latency issues specific to each DC.

With traces from our billing microservices, it empowers us to analyze performance, identify errors, and gain valuable insights into your system’s health you can calculate error rates for individual microservices or API endpoints with success and failed spans with traces. high error rates for a specific API might indicate a need for code review or infrastructure adjustments.

Receiver: This allows your microservices with OpenTelemetry instrumentation to easily send traces and metrics to the collector.

Processor: Processors are typically used for data manipulation or transformation before exporting.

Exporter: It sends the collected telemetry data (traces and metrics) to your Jaeger backend for storage and visualization.

Data flow

Jaeger offers dashboards that provide an overall view of your microservices’ health. You can see latency distributions, error rates, and throughput for each service.

We can view the number of spans in a single API call, traces, success rate/error rate, and latency in Jaeger view. By taking this request view, tracing helps to analyze different behaviors of the systems

You can see that the total duration of the trace was 36.52ms, even if you can look at the traces of the longest request.

Tracing Instrumentation and Interactions

Span

Trace timeline view which is combined from services and calls between microservices with microservices. Gantt chart showing spans with units of work within single services on the horizontal timeline. The top-level span is called the root span, and the timeline view shows a typical view of traces as time sequences of nested spans. Span's objective is to tell about the operations it represents.

Jaeger provides a visual interface to explore individual traces. You can see the flow of requests across your microservices, identify bottlenecks, and pinpoint potential issues.

Logs are recorded by the tracing system; the logs statement explains the nature of the error. All these details were found very helpful and provide very useful information during the debugging process with the Dev Team.

If we expand spans for the services, you can see the data passed which inspects every span in the traces where multiple parameters are visible. Tags in the microservices span are captured by auto instrumentation from the open trace Java Spring library.

Latency

Request from the microservices is done in parallel, more than 4 requests at a time, where one request ends in-between another one starts, so patterns indicate sort of contentions. correlation features to see if concurrent requests are causing contention. You can identify if multiple traces compete for the same resource simultaneously. Pay close attention to span durations within your traces. Bottlenecks or elongated spans in specific services could be signs of contention.

Monitor

We can observe the network latency between 2 consecutive API calls of a single application span.

We can monitor applications under various span kinds like Client, Server, Internal, Production, and Consumer with graphs and charts for P95 latency, request rate, error rate and metrics with trace data for a holistic view of application performance.

Kibana

We can observe the process and operation details in Kibana UI. Integrate Jaeger with your logging and monitoring systems for a holistic view of system health. Analyze traces alongside logs and metrics to gain deeper insights into issues.

The Team Building Great Things Together

We focused on expanding our use case footprint. This wouldn’t be possible without hard work and great contributions from the team to build impactful systems that help our business forward, bringing DevOps Engineering solutions.

Special Thanks to the members of the teams: Ribhu Shadwal ( VP ), and Gaurav Bhatnagar ( EM ).

Special Thanks to stunning Dev colleagues for direct collaboration: Gaurav Walecha and Arzaw.

--

--