Chaos Testing with Istio

James Mak
Airwalk Reply
Published in
4 min readJan 24, 2022
Photo by Soheb Zaidi on Unsplash

Written by James Mak Senior Consultant at Airwalk Reply.

The invention of Kubernetes (K8s) + Istio has largely reduced many emergency phone calls at 2am because it has provided a feasible way to tackle infrastructure level failure and application level failure. Its Cluster AutoScaler can scale up or down the number of nodes in the cluster based on usage. K8s also continuously monitors application pods’ healthiness and restarts the pods if necessary, while Istio has introduced granular control on ingress/egress traffic management at Cluster level and Pod level. They have greatly improved the user experience with less interruption and increased service availability.

However, even though K8s + Istio have got lots of useful features, it is always better to prepare for the worst before the worst comes. This is where Chaos testing comes in. We try to explore what will happen when different components in our system break. Of course, this will be carried out in a controlled environment, we will devise ways to break the system. For example, reduce infrastructure capacity, create high load in compute resource, create network outage, application failure, etc. All common or uncommon outage scenarios that you think of can be included in your Destroyer plan.

On the other hand, we also need our Savior repair strategy to get things restored once Doomsday occurs. We need to experiment with this plan and assess whether it returns our configuration to a stable state as we would want. Hence we build confidence that the service mesh can tolerate failing nodes and can prevent localised failures from cascading to other nodes.

It’s becoming popular for enterprise IT to hold a Game Day to get their IT expertise ‘rehearsed’ in such situations.

Technically speaking, Envoy, an open source lightweight proxy is the building block of Istio. Envoy works alongside the Kubernetes workload pod. It acts as a gateway between the workload pod and the Kubernetes mesh. Envoy intercepts all inbound and outbound traffic to and from the app workload. Hence we can use Envoy to manipulate the traffic by using its versatile routing features.

In the following, I will focus on using Istio to carry out Chaos testing, where some network delay and HTTP error response will be introduced to emulate network issues in microservice-based applications.

Prerequisites

  1. Basic knowledge in Kubernetes
  2. Basic knowledge in Istio
Istio ingress traffic diagram
Istio Ingress traffic diagram

The client request call will first reach Istio Ingress Gateway which matches the Virtual Service and Destination Rule (if any). Based on the routing configuration, the request will be dispatched to the Backend.

Istio provides two kinds of HTTP failure injection at Virtual Service level, they are namely,

  1. HTTP delay fault
  2. HTTP abort fault

We can use HTTP delay fault to introduce network latency when the request reaches the Ingress Gateway. The envoy proxy response flag will be set to DI indicating that the request processing was delayed for a period specified via fault injection. With more granular control, you can specify what percentage of traffic you want to delay. Following is an example YAML file for creating a virtual service injecting a five second delay to ALL matched virtual service traffic.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: test-vs
spec:
hosts:
- backend
http:
- fault:
delay:
percentage:
value: 100
fixedDelay: 5s
route:
- destination:
host: backend
gateways:
- ingress-gateway

Next comes the HTTP abort fault. Following is an example YAML where HTTP response code “500 — Internal Server Error” will be returned to the client for matched traffic. The envoy proxy response flag will be set to FI indicating that the request is aborted with a response code specified.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: test-vs
spec:
hosts:
- backend
http:
- fault:
abort:
httpStatus: 500
percentage:
value: 100
route:
- destination:
host: backend
gateways:
- ingress-gateway

You can use Istio Virtual Service to do Chaos testing at the application layer transparently, by injecting timeouts or HTTP errors into your services, without actually updating your app code. Testing the system in distress to ensure its resilience is extremely important for modern microservice applications with little tolerance for downtime.

For a more orchestrated Chaos Engineering platform, Chaos Mesh will be a choice. It not only does Network Chaos, but is also able to carry Pod Chaos, DNS Chaos, IO Chaos, etc. and visualises the operation.

--

--