Increasing observability on Istio: The new Kiali health configuration
Kiali 1.24 comes with a new health configuration . As we know Kiali has health indicators in different pages like overview, graph, lists…and so on. Kiali bases these health indicators in some signals, one of them on the aggregation of inbound/outbound response codes.
Since health is based in response codes, response with error codes like 401 and 404 are considered as errors. Making the health show a degraded state.
But what happens if we consider the 401/404 a legitimate response code?
Let’s think of a client that is making requests to a service, for example in the next image we have 3 clients (b, c & d) making requests to the y-server and Kiali shows a Degraded status for this service because is returning 4xx. However, the real problem is that the clients are not updated and are trying to use an endpoint that is no longer available.
Another use case could be a service handling authentication where 401 (unauthorized) or 403 (forbidden) responses are “normal” if credentials are wrong so you can use the health config to tell that 401/403 responses are good and prevent Kiali flagging as “unhealthy”.
What does that mean? That clients are consuming an endpoint that doesn’t exist anymore in y-server?
Isn’t that a good case to show that a client is not healthy?
Perhaps saying that a/b/c clients aren’t authorized to consume y-server or something like that?
For these cases we can take advantage of the health configuration feature  to show the correct status according to our environment needs.
Custom health configuration is specified in the Kiali CR.
For each health configuration you need to fill in the following options
By default Kiali will apply the following configuration if it cannot find a configuration that matches the resource whose health it is checking.
In other words, Kiali will check first for a configuration that matches the specified namespace, then the specified resource type (kind), and finally that matches the resource name (the default).
This configuration  indicates that for all resources with all names in all namespaces:
- Error Code 5xx for http protocol in outbound and inbound direction will show degraded state if there requests percent is greater than 0% and failure if is greater or equal than 10%
- Error Code 4xx for http protocol in outbound and inbound direction will show degraded state if there requests percent is greater than 10% and failure if is greater or equal than 20%
- Error Code 1–9 or 1–16 for grpc protocol in outbound and inbound direction will show degraded state if there requests percent is greater than 0% and failure if is greater or equal than 10%
For example in the next configuration we have a health check for all resources in the namespace alpha :
- For 404 code with http protocol, we’ll have a failure status when the requests are greater or equal than 10%
- For 4xx or 5xx code and not 404 with http protocol, we’ll have a failure status when the requests are greater than 0%
For namespace beta will have:
- For 4xx code with http protocol, we’ll have a degraded status when the requests are greater or equal than 30% and a failure status if it’s greater or equal than 40%
- For 5xx code with http protocol, we’ll have a failure status when the requests are greater than 0%
With this new feature we can customize the health of our services to be aligned with our needs, focusing specifically on the errors that matter for our environment.
Community reported that, in the current implementation, it is necessary to reload Kiali every time there is a need to change the health configuration. The team is researching a different approach based in annotations.
For more ideas on how to improve this feature please open an issue in https://github.com/kiali/kiali/issues.
- Improvements in the configuration
- Health based in the duration of requests