Trace my mesh (part 2/3)
A Distributed Tracing walk-through with Jaeger, Istio and Kiali
This is a second of a three-parts series. Previously, we’ve seen how Istio and Envoy help on tracing, how to propagate traces and create spans.
Tracing in Kiali
Kiali is in a sweet spot to leverage tracing. Thanks to Istio and Envoy, there is a nice consistency between traces and metrics. The source and destination workloads identified in the traffic metrics can be clearly correlated with trace spans. You may wonder, is that useful? Isn’t it redundant information?
It’s actually super useful. Metrics offer a large, aggregated view of the data, whereas traces offer sharp insight with traceable causality. Metrics are the wide-angle lens, tracing is the 600mm zoom lens. Both are useful, and correlating the two allows not only to jump from one to the other, but also to map a single trace over a graph topology, to evaluate the performance of spans in a trace based on the wider picture that metrics offer, and probably more that we haven’t yet thought about.
First of all, to ensure you have the best experience with tracing, make sure that Kiali is well configured; both in_cluster_url and url should be set in the external_services.tracing section of the CR / ConfigMap. The first is used by Kiali to connect to Jaeger internally, and the second adds useful links in the UI to jump from Kiali to Jaeger in different ways. If you have any troubles with the configuration, please make sure to read the related FAQ.
Quality of traces
There are two indicators in Kiali that will help evaluate the “quality” of a trace, visible on the Tracing page. The first and most obvious are some explicit representations of traces showing errors, as shown in the pictures below:
As per OpenTracing conventions, Kiali will check if there are some spans in a trace that are tagged with the error key. Note that it is up to the creator of the span to flag it as having errors or not. It can often lead to debatable arguments. For instance, Envoy considers 5xx responses as errors, but not 4xx, even from the client side span. Kiali strictly relies on the error tag to decide whether there’s an error or not.
The second indicator of quality in Kiali is the trace or span performance, based on comparisons of their durations using different approaches. You can see several matrices, as heatmaps, that show how a trace or span performed. We’re going to explore them more in detail next.
The first approach is a matrix showing how the trace performed in comparison with aggregated metrics on several intervals (last 10 minutes, 60 minutes and 1 hour) crossed with different aggregation statistics (average, 50th, 90th and 99th percentile).
As we’ve seen, a trace is a collection of spans. So this view is actually an average of comparisons of each span duration versus the metrics for the same source / destination services. You can view the per-span detail in the other tab, where you can spot which span individually performed well or not.
The heatmaps are presented only for Envoy-generated spans, because they’re the ones where we can safely assume a consistency with the metrics.
You can filter the spans list to show only Envoy spans: select “Filter by Component” > “Proxy”. Clicking on a heatmap expands it. Note also that the spans contain pod information. It’s only in traces that you can have this level of detail of the traffic, since metrics are more aggregated and don’t (always) carry the pod name.
There is a caveat in the trace-to-metrics comparison you should be aware of: since metrics don’t have the level of granularity that traces offer, we cannot restrict the comparison with metrics to the requests that hit precisely the same endpoint / path as the spans did. So for instance, if a service “Foo” exposes the endpoints “GET /foo” and “POST /bar”, and the two have quite different processing times, and also depending on their call frequencies, this will introduce a bias in trace-to-metrics comparisons.
Similar traces comparisons
For that reason, the second type of heatmap displayed is interesting, it’s a trace-to-traces comparison.
For a given trace, Kiali tries to detect which ones, among the traces displayed on screen, are similar enough to be relevant for comparison. Internally, a similarity score is computed using the number of spans they contain and the occurrences of operation name in the spans. Similar traces are compared based on their full duration (from start time of the root span, to the end time of the latest span) and also based on the average span duration. These two values can differ a lot, for example when spans run in parallel, or when there’s dead time between spans.
Interactions with Jaeger UI
Since Kiali works with Jaeger and assumes the Jaeger UI should be available in the nominal case, we have made the choice to have complementary views of tracing, with different approaches. In any given situation you may find that the Jaeger UI is more relevant to get some particular information. For instance, unlike Jaeger, Kiali doesn’t show the full list of tags per span. Instead, it presents a concise summary, mostly based on known tags created by Envoy.
So, to ease the interactions between the two tools, Kiali shows a couple of external links, contextual to their scope:
- for the whole traces query per service
- for a specific trace
- for a specific span
- to the trace comparison view
That “Compare with similar traces” link is an interesting one, it allows to compare two traces more deeply. You can read this article to learn about it.
When coming from Kiali, the comparison view will be pre-populated with the selected trace, plus the ones detected as similar (using the algorithm previously mentioned).
It is also possible to have links the other way around, from Jaeger to Kiali, as described in this article (some information there reflects old versions of Istio and must be adapted, though the configuration on Jaeger UI side is still valid).
And to finish with the overview of Tracing in Kiali, there is of course the trace overlay on top of the graph topology. You can either see it while coming from a trace selected in the Tracing page (link “View on graph”), or by clicking a node directly from the graph, opening the Traces tab from the side panel, then choosing one from a list. When a trace is selected, it offers the possibility to navigate back to the Tracing page when clicking on its name.
The trace panel in this view offers a quick summary of trace and span details, plus some links to the related spans (parent and children) in order to navigate in the trace and quickly locate their nodes in the graph.
In the next and final part, we will discuss about tracing in other, non-HTTP scenarios. To wait, grab some tea/coffee and have a chat with us, on our Slack channel, mailing list, Twitter (me) or the eternal IRC.