Jaeger Tracing: A Friendly Guide for Beginners

Team Aspecto
JaegerTracing
Published in
9 min readFeb 27, 2022

Written by @thetomzach @ Aspecto.

In this guide, you’ll learn what Jaeger tracing is, what distributed tracing is, and how to set it up in your system. We’ll go over Jaeger’s UI and touch on advanced concepts such as sampling and deploying in production.

You’ll leave this guide knowing how to create spans with OpenTelemetry and send them to Jaeger tracing for visualization. All that, from scratch.

What is Distributed Tracing? Introduction

Before we dive into explaining everything you need to know from 0 to 100 about Jaeger tracing, it’s important to understand the umbrella term that Jaeger is part of — distributed tracing.

In the world of microservices, most issues occur due to networking issues and the relations between the different microservices. A distributed architecture (as opposed to a monolith) makes it a lot harder to get to the root of an issue. To resolve these issues, we need to see which service sent what parameters to another service or a component (a DB, queue, etc.).

Distributed tracing helps us achieve just that by enabling us to collect data from the different parts of our system, to enable this desired observability into our system. You can think of it as ‘call-stacks’ for distributed services. In addition, traces are a visual tool, allowing us to visualize our system to better understand the relationships between services, making it easier to investigate and pinpoint issues.

What is Jaeger Tracing?

Now that you know what distributed tracing is, we can safely talk about Jaeger. Jaeger is an open-source distributed tracing platform created by Uber back in 2015. It consists of instrumentation SDKs, a backend for data collection and storage, a UI for visualizing the data, and Spark/Flink framework for aggregate trace analysis.

The Jaeger data model is compatible with OpenTracing — which is a specification that defines how the collected tracing data would look, as well as libraries of implementations in different languages (more on OpenTracing and OpenTelemetry later).

As most other distributed tracing systems, Jaeger works with spans and traces, as defined in the OpenTracing specification.

A span represents a unit of work in an application(HTTP request, call to a DB, etc) and is Jaeger’s most basic unit of work. A span must have an operation name, start time, and duration.

A trace is a collection/list of spans connected in a parent/child relationship (and can also be thought of as a directed acyclic graph of spans). Traces specify how requests are propagated through our services and other components.

Jaeger Tracing Architecture

Here’s what Jaeger architecture looks like

It consists of a few parts, all of which I explain below:

  • Instrumentation SDKs: libraries that are integrated into applications and frameworks to capture tracing data. Historically the Jaeger project supported its own clients libraries written in various programming languages. They are now being deprecated in favor of OpenTelemetry (again, more on that later).
  • Jaeger Agent: Jaeger agent is a network daemon that listens for spans received from the Jaeger client over UDP. It gathers batches of them and then sends them together to the collector. The agent is not required if the SDKs are configured to send the spans directly to the collector.
  • Jaeger Collector: The Jaeger collector is responsible for receiving traces from the Jaeger agent, performing validations and transformations, and saving them to the selected storage backends.
  • Storage Backends: Jaeger supports various storage backends to store the spans. Supported storage backends are In-Memory, Cassandra, Elasticsearch, and Badger (for single-instance collector deployments).
  • Jaeger Query: This is a service responsible for retrieving traces from the Jaeger storage backend and making them accessible for the Jaeger UI.
  • Jaeger UI: a React application that lets you visualize the traces and analyze them. Useful for debugging system issues.
  • Ingester: The ingester is relevant only if we use Kafka as a buffer between the collector and the storage backend. It is responsible for receiving data from Kafka and ingesting it into the storage backend. More info can be found in the official Jaeger Tracing docs.

Running Jaeger locally using Docker

Jaeger comes with a ready-to-use all-in-one Docker image that contains all the components necessary for Jaeger to run.

It’s really simple to get it up and running on your local machine:

docker run -d --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.30

Then you can simply open the jaeger UI on http://localhost:16686.

Jaeger Tracing and OpenTelemetry

Yes, you’re right. I did mention before that Jaeger’s data model is compatible with the OpenTracing specification. You may already know that OpenTracing and OpenCensus have merged to form OpenTelemetry and are wondering why does Jaeger use OpenTracing and if you can use OpenTelemetry to report to Jaeger instead.

As to why Jaeger uses OpenTracing — well, the reason is that Jaeger existed from before the above-mentioned merger.

To get a full understanding of OpenTelemetry, what is it, its components, and how you can use with, read this guide.

Deprecation of Jaeger Client in favor of OpenTelemetry Distro:

I also mentioned that Jaeger clients are now deprecating.

You can find more info about this deprecation here, but essentially the idea is that you should now use the OpenTelemetry SDK in the programming language of your choice, alongside a Jaeger exporter.

This way created spans would be converted to a format Jaeger knows how to work with, passing all the way through to the Jaeger collector and then to the storage backend.

At the time of writing this, the OpenTelemetry collector is not considered a replacement for the Jaeger collector [1]. In the future, the Jaeger collector will be able to receive OTLP, the OpenTelemetry native data format.

If you can’t wait and want to try using the OpenTelemetry collector with Jaeger now — see this guide.

Jaeger Tracing Python Example

Here is a Python example of creating spans and sending them to Jaeger. Note that you could also use automatic instrumentations and still use the Jaeger exporter (assuming you’re running Jaeger locally like shown above):

# jaeger_tracing.py
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
trace.set_tracer_provider(
TracerProvider(
resource=Resource.create({SERVICE_NAME: "my-hello-service"})
)
)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)with tracer.start_as_current_span("rootSpan"):
with tracer.start_as_current_span("childSpan"):
print("Hello world!")

This is what they would look like in the Jaeger UI:

To learn how to hands-on use OpenTelemetry in Python from scratch, read this guide.

Jaeger Tracing UI Review

The Jaeger UI is a powerful tool for us to debug and understand our distributed services better.

Here’s what you need to know about it:

The search pane:

You can use the search pane to search for traces with specific properties: which service they come from, what operation was made, specific tags that were included within the trace (for example, the http status code), how long in the past to look for and result amount limiting.

When you’re done defining your search in this pane, click on Find Traces.

The search results section:

In this example, I chose to query the jaeger-query service. I can see my traces on a timeline or as a list. Click on the desired trace to drill down into it.

The specific trace view:

When you find a specific trace where you think there might be an issue and click on it, you’d see a screen that looks like this:

Here you can find specific information about execution times, which calls were made, their durations, specific properties like http status code, route path (in the case of an http call), and more.

Feel free to play around and investigate for yourself with your actual data.

Advanced Concepts: Sampling

Sampling is a complex topic by itself. in Jaeger the sampling decisions are always made in the SDK, via head-based sampling.

Sampling Strategies in the SDK

The (deprecating) Jaeger SDKs have 4 sampling modes:

  • Remote: the default, and is used to tell the Jaeger SDK that sampling strategy is controlled by the Jaeger backend.
  • Constant: either take all traces or take none. Nothing in between. Receives 1 for all and 0 for none
  • Rate limiting: choose how many traces would be sampled per second.
  • Probabilistic: choose a percentage of traces that would be sampled, for example — choose 0.1 to have 1 of each 10 traces to be sampled.

Remote sampling

If we choose to enable remote sampling, the Jaeger collector becomes responsible for figuring out which sampling strategy an SDK in each service should be using. The operators have two ways to configure the collector: with a sampling strategies configuration file, or with adaptive sampling.

Configuration file — you give the collector a path to a file that contains the per-service and pre-operation sampling configuration.

Adaptive sampling — let Jaeger learn the amount of traffic each endpoint receives and calculate the most appropriate rate for that endpoint. Note that at the time of writing only Memory and Cassandra backends support this.

More info on Jaeger sampling can be found here: https://www.jaegertracing.io/docs/latest/sampling/

Jaeger Tracing Production Deployment

All-in-one or separate containers?

Jaeger all-in-one is a pre-built Docker image containing all the Jaeger components needed to get up and running quickly with Jaeger tracing by launching a single command.

A lot of people (including my past self) ask themselves what’s the correct way to launch Jaeger in production. If it’s safe to use Jaeger all-in-one in production, etc. While at the time of writing I could not find any official answer to use or not to use it, I think the right answer is — you could, but you probably shouldn’t. Using it as in production means you have a single source of failure which is not distributed. Theoretically, an issue even with the Jaeger UI might crush the entire container and you wouldn’t be able to receive critical spans created by your system.

The best way to go about this would be to run each Jaeger component separately, without the all-in-one.

Mastering OpenTelemetry and Distributed Tracing

Jaeger is a distributed tracing beast and the leading open source project for tracing visualization. OpenTelemetry is becoming the industry standard for tracing instrumentation, making it a good place to start learning and implementing traces. To get started with OpenTelemetry, check out this free, vendor-neutral OpenTelemetry Bootcamp.

The Bootcamp includes

  • Episode 1: OpenTelemetry Fundamentals
  • Episode 2: Integrate Your Code (logs, metrics, and traces)
  • Episode 3: Deploy to Production + Collector
  • Episode 4: Sampling and Dealing with High Volumes
  • Episode 5: Custom Instrumentation
  • Episode 6: Testing with OpenTelemetry

If you have any questions, feel free to reach out to me on Twitter @thetomzach and to join our #OpenTelemetry-Bootcamp slack channel to be on top of what’s happening in observability.

Jaeger Tracing Glossary

Span — a representation of a unit of work (action/operation) that occurs in our system; an HTTP request or a database operation that spans over time (start at X and has a duration of Y milliseconds). Usually, it will be the parent and/or child of another span.

Trace — a tree/list of spans representing the progression of requests as it is handled by the different services and components in our system. For example, sending an API call to user-service resulted in a DB query to users-db. They are ‘call-stacks’ for distributed services.

Observability — a measure of how well we can understand the internal states of a system based on its external outputs. When you have logs, metrics, and traces you have the “3 pillars of observability”.

OpenTelemetry — OpenTelemetry is an open-source project at CNCF (Cloud Native Computing Function) that provides a collection of tools, APIs, and SDKs. OpenTelemetry enables the automated collection and generation of traces, logs, and metrics with a single specification.

OpenTracing — an open-source project for distributed tracing. It was deprecated and “merged” into OpenTelemetry. OpenTelemetry offers backward compatibility for OpenTracing.

--

--