Data analytics with Jaeger aka traces tell us more!

Published in

JaegerTracing

6 min readFeb 2, 2020

I will get straight to the point, Jaeger at the moment only visualizes collected data from instrumented applications. It does not perform any post-processing (except service dependency diagram) or any calculations to derive other interesting metrics or features from traces it collects. This is a pity because traces contain the richest information from all telemetry signals combined!

In Jaeger we want to address this problem and provide a platform for data scientists and operators to verify a hypothesis and ultimately answer the question of what caused an outage or why the system behaves in a certain way. Apart from on-demand incident investigation, the goal is to also derive insights from all traces Jaeger collects as part of the standard Jaeger deployment. So let’s have a look at some use cases and then at the technical details.

Metrics

Before we deep dive into the platform overview I would like to talk about what metrics can be derived from traces. Certainly different post-processing jobs will produce results of different structures, but for now let’s only have a look at metrics as these can be directly consumed by metric systems so the solution does not require any custom storage and presentation layer.

Network latency

A trace contains end-to-end information about the request/transaction. By doing some minimal calculations we are able to derive network latency between client and server calls. The results can be exported as a histogram and partitioned by client and server service labels. Example results in Prometheus metrics:

network_latency_seconds_bucket{client="frontend",server="driver",le="0.005",} 8.0
network_latency_seconds_bucket{client="frontend",server="driver",le="0.01",} 9.0 network_latency_seconds_bucket{client="frontend",server="driver",le="0.025",} 12.0 network_latency_seconds_bucket{client="frontend",server="driver",le="0.05",} 15.0 network_latency_seconds_bucket{client="frontend",server="driver",le="0.075",} 20.0

A further improvement would be to use host as a label, as a service may be load balanced across multiple hosts, with different network characteristic.

Another variation of this metric might be a duration between a consumer and a producer in messaging systems.

Trace and service depth

Call graph with service depth of 3 — the maximum number of hops between root service and leaf services.

Sometimes it is important to validate the structure of the call graph in our microservice architecture. For instance, we might be interested to know what is the maximal depth of our call graph which can be used to find outlier traces with unusual deep structures.

Service depth is the maximum network hops between root span and leaf spans.

Service dependencies

Another metric that falls into a trace structure category are:

a number of dependencies.
a number of dependent services of a service.

Trace quality

There is no doubt that proper tracing instrumentation is the most difficult part of rolling out tracing infrastructure in an organization. Therefore it is fundamental to measure how well applications are instrumented to evaluate tracing adoption. These metrics can be used:

jaeger_client_version an appropriate Jaeger client version is used in an application.
server_span and client_span — a trace contains right combination of server and client spans. For instance, if there is a client span there should be an appropriate server span. The same metric could be used for messaging spans — producer and consumer, to find out whether both ends of the channel are instrumented.
unique_span_id — spans within a trace contain unique span id.

trace_quality_minimum_client_version_total{pass="false",service="route",version="Go-2.21.1",} 100.0 trace_quality_minimum_client_version_total{pass="false",service="redis",version="Go-2.21.1",} 119.0 trace_quality_minimum_client_version_total{pass="false",service="customer",version="Go-2.21.1",} 10.0 trace_quality_minimum_client_version_total{pass="false",service="driver",version="Go-2.21.1",} 10.0 trace_quality_server_tag_total{pass="false",service="mysql",} 10.0 trace_quality_server_tag_total{pass="true",service="customer",} 8.0 trace_quality_client_tag_total{pass="true",service="frontend",} 67.0 trace_quality_client_tag_total{pass="false",service="driver-client",} 2.0

Trace quality dashboard in Grafana. Currently showing quality results for frontend service.

These quality metrics were initially open-sourced in Jaeger’s Flink repository by Uber. The solution calculates metrics and stores the results in a Cassandra table. The results are ultimately just counters, therefore we can export them to any metric system. However, the former solution also provides links to traces that fail a certain quality indicator. Which was proven to be very useful. To fully support this feature we have to wait until OSS metric systems support trace exemplars. However, even trace exemplars might not be sufficient if hundreds of traces will have to be linked to a single metric data point.

Trace DSL

Okay, we have talked about some use cases and we have defined our goal to provide a platform where these use cases could be easily implemented and executed as part of the standard Jaeger deployment. To make it dead easy to write aggregation jobs, filtering, navigation within a trace and feature extraction we should also provide an API and a library to process a trace or set of traces.

A trace is a directed acyclic graph, therefore it makes a lot of sense to represent it as a graph. We have decided to reuse an existing graph API and query/traversal language Gremlin from the Apache TinkerPop project. The project also provides an in-memory database TinkerGraph which we use once we load traces from the storage (Kafka, Jaeger-query).

Let’s have a look at some examples of our Trace DSL. The first example answers the question “Is there a client span with duration 120 microseconds?”

TraceTraversal<Vertex, Vertex> traversal =      
  graph.traversal(TraceTraversalSource.class)
    .hasTag(Tags.SPAN_KIND.getKey(), Tags.SPAN_KIND_CLIENT)
    .duration(P.gt(120));

As you might have noticed the query uses two methods from Trace DSL: hasTag and duration . These methods were added to Gremlin core API via TraceTraversalSource.class. The result is a list of vertices/spans that satisfy this query. From the vertices/spans we can navigate to other parts of the trace.

The second example answers the question: “Is span with a name child2 a direct descendant of root span?”.

TraceTraversal<Vertex, Vertex> traversal = 
  graph.traversal(TraceTraversalSource.class)
    .hasName("root")
    .repeat(__.out())
    .until(__.hasName("child2"));

This query is more complicated and it uses core Gremlin API calls like repeat(__.out()) to traverse to outgoing edges. It also makes sense to provide this query as an extension to Gremlin API if it becomes commonly used. I acknowledge that writing Gremlin queries is not trivial, hence feature full Trace DSL should simplify the work.

Architecture

The following diagram depicts Jaeger streaming architecture with data analytics integration.

There are two parts of the analytics platform: Spark streaming for all incoming data and on-demand Jupyter notebook.

The Spark streaming connects to the same Kafka topic used by the Jaeger collection pipeline. It consumes and analyzes the data, exposing the results as Prometheus metrics or writing them to the storage.

The second integration path is done via Jupyter notebooks. The notebook can connect to Kafka to get the stream of data or get the historical data from Jaeger query. Then do the analysis and show results in the notebook or publish to Prometheus or the storage.

Conclusion

We have talked about the reasons and use cases for data analytics platform for Jaeger. Our goal is to build a community for this effort to help develop the models but also validate them on real deployments. Ultimately, analytics features should provide us more insights into our application behavior and a power-user interface for incident analysis. The project is still in early development and we would like to hear your feedback! Do not hesitate to reach out to us directly or create a feature request in the repository.

References

Jaeger Java analytics: https://github.com/jaegertracing/jaeger-analytics-java
Jaeger analytics flink: https://github.com/jaegertracing/jaeger-analytics-flink
Apache TinkerPop Gremlin: https://tinkerpop.apache.org/gremlin.html
Apache TinkerGraph docs: http://tinkerpop.apache.org/docs/current/reference/#tinkergraph-gremlin