Behind the Scenes: A Deep Dive into Distributed Tracing with Grafana, Tempo, and Jaeger

Rasheedat Atinuke Jamiu
HostSpace Cloud Solutions
7 min readMay 15, 2024

Introduction

Modern applications are increasingly built using microservices architectures. Unlike monolithic applications, which are developed as a single, self-contained unit, microservices break down the application into smaller, independent services. Each service has a specific functionality and often operates under the control of a dedicated team. This modular approach offers a lot of benefits, such as faster development cycles, easier deployments, and improved fault tolerance. However, this very same modularity presents a significant challenge — debugging.

Imagine a complex web of interconnected components, each potentially playing a role in a user’s request. When something goes wrong, pinpointing the root cause becomes a challenge. The traditional approach of examining a single codebase no longer applies. Instead, developers must navigate a distributed backend, where requests often involve sequences of calls across multiple services. This distributed nature makes it difficult to track the flow of a request and identify which service is causing the issue.

To tackle this challenge, developers are turning to distributed tracing. Distributed tracing acts like a detective for your application. It tracks each user request as it journeys from the user’s device, through various backend services, and ultimately to the databases that store information. This detailed map of the request’s path empowers developers to pinpoint problems that cause slowdowns (high latency) or errors within the application.

Core Components of Distributed Tracing

The components of a distributed tracing system might vary based on the implementation, but a typical system is built up with the following components:
1. Trace

Imagine a single user request traveling through a complex network of services in a distributed system, like a containerized application or a microservices architecture. A trace captures this entire journey, providing a detailed picture of where the request goes. This detailed record empowers developers to analyze system performance, pinpoint bottlenecks, and troubleshoot any problems that arise when services interact.

2. Span

A span represents a specific task or action completed by a single service. Each span captures how long the task took and includes details about the operation itself. By analyzing these individual spans, developers can gain insights into the performance and behavior of each service within the entire system

3. Instrumentation libraries

To collect the data that makes tracing possible, developers rely on special software components called instrumentation libraries. These libraries become embedded within the applications themselves, They’re specifically designed to generate, collect, and manage the trace data, which includes details about individual operations (spans) and the overall request journey (trace).

Think of them as tiny data collectors. Instrumentation libraries automatically capture the start and end times of each operation, along with any relevant details (metadata). They also play a crucial role in ensuring that the tracing context, which acts like a unique fingerprint for the request, travels seamlessly across different services in the distributed system. This allows developers to follow the entire request path and pinpoint any issues that might arise.

Popular examples of these instrumentation libraries include OpenTelemetry, Jaeger, Zipkin, New Relic, and Datadog.

4. Data collectors

After the instrumentation of the services (application), trace data is received and stored by these collectors. Some of the popular trace data collectors you may come across are OpenTelemetry Collector, Jaeger, Zipkin, and Grafana Agent. These tools ensure the collected data is readily available for analysis and visualization.

Tempo: A Scalable Tracing Backend

Grafana Tempo is a user-friendly, open-sourced, high-scale distributed tracing backend that allows you to not only search for specific traces but also to generate metrics from individual spans within those traces. This provides a deeper level of analysis, allowing you to correlate tracing data with logs and metrics for a holistic view of your system’s performance.

Tempo integrates seamlessly with other popular open-source tools like Grafana (for visualization), Mimir (for log analysis), Prometheus (for monitoring), and Loki (for log storage). Tempo also works with various open-source tracing protocols, including Jaeger, Zipkin, and Otlp, offering flexibility for your existing infrastructure.

Set up Grafana Tempo on Kubernetes

Grafana Tempo has 2 deployment plans: the monolithic and microservices plans. Depending on your tracing needs, you can use either of these plans. for this demo, we would use the monolithic plan, You can check out Tempo documentation to decide on what plan suits your needs.

We will install tempo using the Tempo Helm charts available in the grafana/helm-charts repository:

Add the following repo to use the chart:

helm repo add grafana https://grafana.github.io/helm-charts

Run the command below To install the chart with the release name Tempo

helm install Tempo Tempo grafana/tempo

You can also deploy Tempo using the ArgoCD application manifest below

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: tempo
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
destination:
namespace: tempo
server: https://kubernetes.default.svc
source:
repoURL: https://grafana.github.io/helm-charts
chart: tempo
targetRevision: 1.7.2
helm:
values:
project: default
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

Enabling and visualizing traces with Grafana Tempo

We will be using Jaeger’s example Hot R.O.D application for this demo; the application has already been instrumented to send traces. You can check how to enable OpenTracing by navigating through the repo.

We would use the hotrod docker image to create our app manifest


apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: hotrod-app
namespace: argocd
spec:
destination:
namespace: hotrod-app
server: https://kubernetes.default.svc
project: default
sources:
repoURL: https://github.com/your-repo/staging-gitops.git
targetRevision: main
ref: values
repoURL: https://github.com/your-repo/my-Helm-Chart.git
targetRevision: main
path: ./
helm:
valueFiles:
- $values/manifests/values.yaml
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

We are setting up some environmental variables in our Hotrod app to enable Jaeger tracing

image:
repository: jaegertracing/example-hotrod
tag: 1.41.0
pullPolicy: Always

ConfigMap:
enabled: true
Data:
JAEGER_AGENT_HOST: tempo.tempo.svc.cluster.local
JAEGER_AGENT_PORT: '6831'
JAEGER_SAMPLER_TYPE: const
JAEGER_SAMPLER_PARAM: '1'

This section explains how to configure tracing for your application using Jaeger. Here are the key settings:

  • JAEGER_AGENT_PORT: This defines the port used to communicate with the Jaeger agent (usually 6831 by default).
  • JAEGER_SAMPLER_TYPE: This setting determines how often tracing data is collected. You can choose from four options:
    remote: Decisions are made by a remote sampler.
    b. const: Every trace is sampled (useful for debugging).
    c. probabilistic: Traces are sampled with a specific probability (between 0 and 1).
    d. rate-limiting: Samples a certain number of traces per second.
  • JAEGER_SAMPLER_PARAM: This parameter is used with specific sampler types.
    For const, it has no effect.
    For probabilistic, it sets the sampling probability (0 for none, 1 for all).
    For rate-limiting, it sets the maximum number of traces sampled per second.
  • JAEGER_TAGS: This allows you to define custom key-value pairs (tags) that will be attached to all tracing data.

Note: Applying these configurations requires a running Kubernetes cluster.

With this configuration, you can now deploy your application and start sending some requests to generate traces.

Visualising Traces

With your application instrumented to send tracing data, the next step is to visualize those traces for analysis. To achieve this, ensure you have both Grafana and Loki installed within your Kubernetes cluster. We will install Loki and Grafana. using Helm (Make sure you have Helm installed.)

Add the helm repo using the command below

helm repo add grafana https://grafana.github.io/helm-charts

Install the charts

helm upgrade --install grafana grafana/grafana
helm upgrade --install loki grafana/loki-stack

Next, we need to add Tempo and Loki data sources to Grafana. You can do this by manually adding it to the UI or by having it in the code, i.e., in our Grafana values.yaml

grafana.ini:
datasources:
- name: loki
type: loki
url: http://loki-gateway.monitoring.svc.cluster.local
access: proxy
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id": "(\w+)"'
name: TraceID
url: $${__value.raw}
- name: Tempo
uid: tempo
type: tempo
access: proxy
url: http://tempo.tempo.svc.cluster.local:3100

Upgrade the Grafana helm chart with the new values to reflect the changes.

helm upgrade --install grafana grafana/grafana --values=values.yaml

Switching Between Logs and Traces in Grafana

To visualize metrics and trigger requests from the HOTROD UI, let’s explore the corresponding data in Grafana by following the steps below.

  • Navigate to Explore. Within Grafana, head to the “Explore” section.
  • Select Loki Data Source: Choose “Loki” as your data source for log exploration.
  • Filter by Application Logs: Apply the filter {app="hotrod”} to focus on logs related to your HOTROD application.

Once you’ve applied the filter, you’ll see two key features:

  • Derived Field — TraceID: A derived field named “TraceID” will appear. This field automatically generates internal links. Clicking this link will seamlessly take you to the corresponding trace data within Tempo.
  • TraceID Label: You’ll also find an additional label named “traceID.” Clicking this label directly opens the associated trace data in Tempo, displaying all traces related to that specific ID.

Conclusion

This functionality simplifies switching between log messages and their corresponding traces within Grafana. You can easily navigate between the two data sources for a more comprehensive understanding of your application’s behavior. Additionally, the automatically generated links and labels eliminate the need for manual parsing of log messages to identify relevant trace IDs.

References

--

--