Kiali: Observability in Action for Istio Service Mesh
In this article we will show how to use Kiali capabilities to observe and manage an Istio Service Mesh scenario. We will use a reference demo application to demonstrate how Kiali can compare different service versions and how can use actions to configure traffic routing using Istio config resources.
Travel Agency Demo
Travel Agency is a demo microservices application that simulates a travel portal scenario. It creates two namespaces:
- a first travel-portal namespace will deploy several services representing the portals where users enter to search and book travels. There are different portals to represent an heterogeneus scenario, like personalization per city, or different target users (normal vs vip users). All portals consume a travel quote service hosted in the next namespace.
- a second travel-agency namespace will host services that will calculate quotes for travel. A main travels service will be the business entry point for the travel agency. It receives a city and a user as parameters and it calculates all elements that compose a travel budget: airfares, lodging, car reservation and travel insurances. There are several services that calculate separate prices and the travel service is responsible to aggregate them in a single response. Additionally, some users like vip users can have access to special discounts, managed by an external service.
The interaction between all services from the example can be shown in the following picture:
In the next steps we are going to deploy a new version of the travel agency application that will run in parallel with the first version deployed. Let’s imagine that the next version will add new features that we want to test with live users and compare how are the results. Obviously, in the real world this could be complex and highly dependent on the domain, but for our example, we will focus on the response time that portals will get assuming that a slower portal will cause our users to lose interest.
One of the first steps we can do in Kiali is to enable Response time labels on the Graph:
The graph helps us to identify those services that could have some problems. In our example everything is green and healthy, but the Response time shows some suspects that the new version 2 probably has some slower features compared with version 1.
Our next stop will be to take a closer look into the travels application metrics:
Under the Inbound Metrics tab we will have data about the portal calls, Kiali can show metrics split by several criteria. Grouping by app shows that all portals have increased the response time since the moment version 2 was deployed.
If we show Inbound metrics grouped by app and version, then we spot a interesting difference: response time in general has been increased, but portals that handle vip users have worse behaviour.
Also, we can continue using Kiali to investigate and correlate these results with traces:
And also with logs from the workloads if it would be necessary to get more information:
From our investigation phase we have spotted a slower response time from version 2 and even slower for vip user requests.
There can be multiple strategies from here, like undeploying the whole version 2, partial deployment of version 2 service by service, limiting which users can access to the new version, or a combination of all of those.
In our case, we are going to show how we can use Kiali Actions to add Istio traffic routing into our example that can help to implement some of the previous strategies.
A first action we can perform is to add Istio resources to route traffic coming from vip users to version 2 and the rest of the requests to version 1.
Kiali allows to create Istio configurations from a high level user interface. From the actions located in the service details we can open the Matching Routing Wizard and discriminate requests using headers as it is shown in the picture:
Kiali will create the specific VirtualService and DestinationRule resources under the service. As part of our strategy we will add similar rules for the suspected services: travels, flights, hotels, cars and insurances.
When we have finished creating Matching Routing for our version 2 services we can check that Kiali has created the correct Istio configuration using the “Istio Config” section:
Once this routing configuration is applied we can see the results in the Response time edges of the Graph:
Now in our example, all traffic coming from vip portals will be routed to the version 2, meanwhile the rest of the traffic is using the previous version 1 which has return to its normal response time. The graph also shows that vip user requests have extra load as they need to access the discounts service.
If we examine the discounts service, we can see big differences between response time from version 1 versus version 2:
Once we have spotted a clear cause for the slower response, we can decide to move most of the traffic to the version 1 but maintain some of the traffic to version 2 to get more data and observe the differences. This action will help to not impact too much into the overall performance of the app.
We can use the Weighted Routing Wizard to set 90% of the traffic into version 1 and maintain only a 10% for version 2:
Once the Istio configuration is created we can enable Request percentage in the graph and examine the discounts service:
Kiali also allows to suspend traffic partially or totally for a specific destination using the Suspend Traffic Wizard:
This action allows you to stop traffic for a specific workload, or to stop the whole service implementing a strategy to “fail sooner”and recover rather than letting the slow requests flood the overall application.
Microservices scenarios demand good observability tooling and practices. In this article we have showed how to combine Kiali capabilities to observe, define strategies and perform actions on a Istio based microservices application.