Recently I came across this excellent post from Cindy Sridharan, which throws a bunch of good ideas about how to improve the troubleshooting experience with microservices. I think that Kiali already has a good approach in that regard, which doesn’t mean there’s no room for improvement.
Kiali is meant to be the Istio console. This has been a constant focus of the development team since the beginning of the project. We try to show the best out of Istio telemetry, and to interact in various ways with the available Istio resources.
However, Kiali also plays a role in troubleshooting, not just within the boundaries of the Istio mesh metrics, but also beyond. We want to help developers and operators to detect any malfunctioning service and debug them. If we can help to correlate signals coming from Istio, to other signals coming from pods and workloads, we may want to do it.
So this is why, from its early days, Kiali embeds a Tracing view provided by Jaeger:
This is also why, a couple of months ago, we’ve added a Logs tab in Workloads details view…
… and we’ve added some fast navigation links between Graph nodes and these parts via a context right click menu.
Also we’ve introduced some capabilities of runtimes monitoring: Kiali is not only able to show Istio metrics, but also any kind of metrics produced by your pods.
It comes with a default set of dashboards for several popular runtimes or frameworks, namely Go, Node.js, Quarkus, Spring Boot, Thorntail and Vert.x, that are automatically recognized when metrics are found in Prometheus. People can also bring their own dashboards. This is of course all documented.
A troubleshooting scenario
For this scenario, I’ll be using the Mesh-Arena demo: it’s four microservices talking to each other (AI, Stadium, Ball and UI). The AI has two different versions, named “locals” and “visitors”. You don’t necessarily need to get a full understanding of this demo to follow the steps, so I’m stopping here the description but you can find everything here.
Now, imagine something unexpected happened. This is how any good troubleshooting story starts, right?
We’ve just delivered new services, but we’re spotting a small number of 404 errors on a connection.
The Graph tells us many things, among which:
- most connections are fine but there’s 9% of 404 between AI (locals) and Stadium
- we can see that there’s no problem with the other version of AI, visitors
So that makes me think the problem is likely on the AI (locals) side, not Stadium. We will start from there.
When right-clicking on the “ai-locals” node, a menu appears for quick navigation links. My first reflex is to check the logs, don’t you? If not, other options are up to you.
Note that you could also get a quick link to Traces when you right-click on a Service node (as opposed to Workload as I’m doing here).
I’m feeling lucky, we’ve got a stack trace showing exactly what we were looking for! So I now just have to open my IDE and retrieve that line 115 in AI.java, to fix what seems to be a little typo.
Note that we came from a workload node (ai-locals) in the Graph page. In a unique workload there may be multiple pods, and in a pod multiple containers. So we may have to navigate a little bit with the pods and containers selectors to show the logs of each of them, if we’re less lucky than today.
Now let’s branch our scenario to the unlucky case.
OMG, no logs, what a pain! So we’ll have a look at the metrics.
Here, we won’t learn a lot from Istio outgoing metrics. We can see that there’s a rather low proportion of 404, but we already knew that. What we can learn however is that the request size is quite low for these 404, compared to the others. If we have a good knowledge of the kind of queries that our workload runs, maybe we can make some sense of this.
But let’s continue our troubleshooting session. There’s some JVM metrics around, that Kiali has detected.
Well, I didn’t expect to see much interesting info here while chasing down a 404. In other situations however, that would definitely help. Switching to the next tab: Vert.x client metrics (here, all the microservices are using Vert.x with metrics enabled).
Here we’ve got some metrics that are quite similar to the Istio ones, however the responses counter has the Path label which will be helpful. In the metrics settings, we can check the “Code” and “Path” labels.
Here we go. In the response counter chart, we can see that all the 404 triggered were querying for path /infox, which is certainly a typo in the code.
By playing with the metrics options we can also detect some behavioural oddity. For instance, here I’m filtering the metrics to show only the paths /info and /infox, and I’m seeing some weird symmetry.
Which lets me think that the /infox requests are performed in place of the correct ones. (Yes, that was intentionally coded as such).
Kiali can definitely help in troubleshooting scenarios. I’ve tried to emphasize what it’s good at: troubleshooting without losing context. Jumping from an area to another while keeping the focus on the same entity. From the Details pages, there’s also a link to return to the Graph, which will trigger a pretty animation to quickly locate where we come from within the Graph.
Of course, other tools will help, too. Grafana is probably the most popular one (and I also love what they’re doing with Loki). For troubleshooting, the approach between the two is different and I believe they are very complementary: Kiali offers a good overview of the whole Service Mesh to identify high level issues, navigate between services and drill-down to the root cause while keeping context. In the best case, it will be possible to identify the root cause directly in Kiali — potentially, using the embedded Tracing view, too.
Grafana lets you build fancy dashboards with a high level of customization, and it offers more possibilities in metrics manipulations. For these reasons, Kiali does acknowledge the presence of Grafana and builds bridges to its Istio dashboards when they are available. Doing so, we keep the focus on the workload or service that we were looking at, hence offering a smoother experience for a cross-tools drill-down process.
And what do YOU think? Tell us, according to you, how we could help to improve the troubleshooting experience.
In that regard, we’ve created a couple of feature requests that you can upvote or downvote. Don’t hesitate also to create new ones! We would add them to this post.
- Add more links to Grafana dashboards: from Runtimes Metrics pages, Istio metrics pages and/or as Graph quick link?
- Provide ways to reduce the search space within the user session? E.g. global label filters.
- Show triggered alerts for a given workload, managed by another system such as Prometheus alert manager?
- More metrics manipulation in dashboards? (custom time range, custom queries, editing dashboards …)
- Drill down from workload to individual pods on the graph?
- Show Istio configurations changes during a specific time range?
- What else?