Kiali with production-scale Prometheus

Published in

Kiali

6 min readApr 30, 2020

What is “production-scale Prometheus”?

Of course, a definition of “production-scale Prometheus” can be as wide as the variety of cases where Istio and Prometheus are used in production. So in the context of this article, we have to make some assumptions.

First of all, this article focuses on Istio using Telemetry v2, which is enabled by default starting from Istio 1.5. This feature was also present as an experimental feature (disabled by default) in previous releases of Istio.

Secondly, this post is written in reaction to the Istio guidelines that were written precisely to describe how to set up Prometheus for production-scale. So you can refer to these guidelines, or also read the article that inspired them, to get the details of that setup. But let me summarize the key points:

Unlike in telemetry v1, envoy sidecars directly report (expose) the Istio metrics to Prometheus, instead of going through Mixer. The main motivation is to remove the bottleneck that Mixer was with respect to the telemetry. But a side effect is that it increases the metrics cardinality because they are now per-pod, not per-workload (Mixer was doing this pods aggregation). The production-scale setup addresses this issue.
The general idea of production-scale Prometheus is to use federation , recording rules, and reduce the metrics retention time to a minimum in Istio’s Prometheus. Federation means that an higher-level Prometheus instance will collect data from subordinated Prometheus instance(s), all of them being configured independently. Here, a “main” Prometheus instance will grab pre-aggregated metrics from the Istio one, and will be configured with a longer retention time (see graph below). In the details, there are several ways to implement this. The Istio guidelines present two scenarios.

Illustration of this Prometheus federation setup, reducing metrics cardinality

First scenario

The first scenario consists in writing recording rules to sum pod-based metrics per workload. Beside this aggregation, the resulting metrics are structurally almost identical to the initial pod-based ones, so that compatibility is preserved for consumers, such as Kiali or existing Grafana dashboards. But it turns out it has a drawback related to summing metrics before computing rates. The consequence of this issue, described here, is that some false spikes might be generated in rates under certain circumstances, not reflecting reality.

Second scenario

In the second scenario, the recording rules perform more: rates are computed per workload, percentile-based distributions are computed on histograms — the kind of work that Kiali does on its own — so that the metrics are fundamentally different and will break compatibility with consumers such as Kiali or existing Grafana dashboards. Also, the produced metrics leave less flexibility at query time because there’s a number of assumptions that are done in the recording rules, such as fixing a rate interval or the percentiles on histograms. But on the flip side, the result is more correct: it won’t show the false spikes triggered in the first scenario.

So, what to do with Kiali?

Well, this is largely up to you, but we have explored a couple of options here.

Option 1: do not change anything

Should you set up the first or the second scenario (even both of them) from Istio guidelines, you can leave Kiali configuration unchanged, have it keeping watching Istio’s Prometheus, and everything will work perfectly fine. The only drawback is that, given the short retention time on metrics, you will not be able to retrieve data too far in the past.

So, perhaps it’s fine if you use Kiali only to check the live status of your mesh. But you would be more limited with the troubleshooting capabilities that Kiali offers, including the graph replay feature, or metrics and traces correlation when you need to look back in the past.

Option 2: with Istio’s first scenario

As we’ve seen, the first scenario described in the guidelines has an issue that can result in false spikes on the computed metrics, but perhaps this is something you can live with. It depends on your expectations of the Istio metrics: is it a critical part of your setup for which you require no imprecision (e.g. with Prometheus alerts or autoscalers built on that) or is it more informational?

If this scenario is good to you, once your Prometheus setup is complete, the configuration in Kiali is easy: just point the Prometheus URL to your main Prometheus instance instead of the Istio one.

Option 3: with Istio’s second scenario

We said previously that this scenario would break Kiali, and indeed if you point the Prometheus URL in Kiali config to your main Prometheus instance with this setup, Kiali wouldn’t show much: no graph, no service health, no Istio metrics…

But we’ve got an alternate solution. Kiali can already be configured with a second URL for Prometheus that is used not for fetching Istio metrics, but for other kinds of metrics, such as Envoy ones or your application-specific ones. This is referred to as Custom Dashboards on Kiali’s site and can be reused in our situation.

The idea is to keep the Prometheus URL configured on Istio’s Prometheus (so that graph, health and the like continue to work as usual, but with short retention time, as in Option 1); and configure this “custom_metrics_url” (update: cf footnote [1]) to point to your main Prometheus.

Then, we can build a dashboard that shows the longer-term, pre-aggregated metrics. Building a custom dashboard for Kiali is already documented, but we provide below a working example for the current case.

With this dashboard, it is possible to have the traces and metrics correlation, see metrics back in time, etc. Unfortunately, the graph replay feature is still something that is bound to the retention time of Istio’s Prometheus.

Working examples

You can find working examples in this GIT repository. It’s just a couple of YAML files so don’t hesitate to look at them and adapt to your needs. But I think they are already quite good to use as they are.

Istio’s first scenario

This Istio’s Prometheus ConfigMap shows a full set of recording rules that preserve compatibility with Kiali or Grafana. Here is a ConfigMap example for the main Prometheus instance. Both of them follow the Istio guidelines. You also need to modify the retention time from Istio’s Prometheus command line in the pod template.

To modify Kiali configuration, set external_services.prometheus.url to your main Prometheus in Kiali CR (if you use the Kiali operator) or ConfigMap.

Istio’s second scenario

Here’s the Istio’s Prometheus ConfigMap for that scenario. It has been slightly modified to be more exhaustive than the example given in Istio guidelines: it includes TCP metrics, more percentiles, more labels. In the recording rules, it is especially important to keep the app and version labels so that Kiali can figure out which workload the metrics relate to.

The ConfigMap for main Prometheus is also changed to perform some relabelling on app and version labels, which is necessary at the moment for Kiali.

Here also, the retention time from Istio’s Prometheus command line must be changed to a short time.

To modify Kiali configuration, set external_services.prometheus.custom_metrics_url (update: cf footnote [1]) to your main Prometheus in Kiali CR (if you use the Kiali operator) or ConfigMap.

And finally, there are two Kiali dashboards for longer-term data, one for HTTP and one for TCP. Just install them in istio-system namespace:

kubectl apply -f ./kiali-master-2-dashboard-http.yml -n istio-system
kubectl apply -f ./kiali-master-2-dashboard-tcp.yml -n istio-system# Note: kiali-master-2-dashboard-http.yml uses a feature that is new at the time of writing this post. It will be in Kiali 1.18 and above. For older versions you can use ./kiali-master-2-dashboard-http-old.yml instead.

This is what you can see once everything is set up and running:

HTTP metrics from our main Prometheus instance

[1] Update: the setting external_services.prometheus.custom_metrics_url has been changed to external_services.custom_dashboards.prometheus.url since Kiali 1.23, see the configuration documentation.

Kiali with production-scale Prometheus

What is “production-scale Prometheus”?

So, what to do with Kiali?

Working examples

Written by Joel Takvorian