Airwalk Reply
Published in

Airwalk Reply

Chaos Testing with Istio

Photo by Soheb Zaidi on Unsplash

Written by James Mak Senior Consultant at Airwalk Reply.

The invention of Kubernetes (K8s) + Istio has largely reduced many emergency phone calls at 2am because it has provided a feasible way to tackle infrastructure level failure and application level failure. Its Cluster AutoScaler can scale up or down the number of nodes in the cluster based on usage. K8s also continuously monitors application pods’ healthiness and restarts the pods if necessary, while Istio has introduced granular control on ingress/egress traffic management at Cluster level and Pod level. They have greatly improved the user experience with less interruption and increased service availability.

However, even though K8s + Istio have got lots of useful features, it is always better to prepare for the worst before the worst comes. This is where Chaos testing comes in. We try to explore what will happen when different components in our system break. Of course, this will be carried out in a controlled environment, we will devise ways to break the system. For example, reduce infrastructure capacity, create high load in compute resource, create network outage, application failure, etc. All common or uncommon outage scenarios that you think of can be included in your Destroyer plan.

On the other hand, we also need our Savior repair strategy to get things restored once Doomsday occurs. We need to experiment with this plan and assess whether it returns our configuration to a stable state as we would want. Hence we build confidence that the service mesh can tolerate failing nodes and can prevent localised failures from cascading to other nodes.

It’s becoming popular for enterprise IT to hold a Game Day to get their IT expertise ‘rehearsed’ in such situations.

Technically speaking, Envoy, an open source lightweight proxy is the building block of Istio. Envoy works alongside the Kubernetes workload pod. It acts as a gateway between the workload pod and the Kubernetes mesh. Envoy intercepts all inbound and outbound traffic to and from the app workload. Hence we can use Envoy to manipulate the traffic by using its versatile routing features.

In the following, I will focus on using Istio to carry out Chaos testing, where some network delay and HTTP error response will be introduced to emulate network issues in microservice-based applications.

Prerequisites

  1. Basic knowledge in Kubernetes
  2. Basic knowledge in Istio
Istio ingress traffic diagram
Istio Ingress traffic diagram

The client request call will first reach Istio Ingress Gateway which matches the Virtual Service and Destination Rule (if any). Based on the routing configuration, the request will be dispatched to the Backend.

Istio provides two kinds of HTTP failure injection at Virtual Service level, they are namely,

  1. HTTP delay fault
  2. HTTP abort fault

We can use HTTP delay fault to introduce network latency when the request reaches the Ingress Gateway. The envoy proxy response flag will be set to DI indicating that the request processing was delayed for a period specified via fault injection. With more granular control, you can specify what percentage of traffic you want to delay. Following is an example YAML file for creating a virtual service injecting a five second delay to ALL matched virtual service traffic.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: test-vs
spec:
hosts:
- backend
http:
- fault:
delay:
percentage:
value: 100
fixedDelay: 5s
route:
- destination:
host: backend
gateways:
- ingress-gateway

Next comes the HTTP abort fault. Following is an example YAML where HTTP response code “500 — Internal Server Error” will be returned to the client for matched traffic. The envoy proxy response flag will be set to FI indicating that the request is aborted with a response code specified.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: test-vs
spec:
hosts:
- backend
http:
- fault:
abort:
httpStatus: 500
percentage:
value: 100
route:
- destination:
host: backend
gateways:
- ingress-gateway

You can use Istio Virtual Service to do Chaos testing at the application layer transparently, by injecting timeouts or HTTP errors into your services, without actually updating your app code. Testing the system in distress to ensure its resilience is extremely important for modern microservice applications with little tolerance for downtime.

For a more orchestrated Chaos Engineering platform, Chaos Mesh will be a choice. It not only does Network Chaos, but is also able to carry Pod Chaos, DNS Chaos, IO Chaos, etc. and visualises the operation.

--

--

--

Airwalk Blog. Thought Leadership for Cloud, DevOps, Change and Transformation

Recommended from Medium

Top 10 Most Popular GitHub Repos Leaderboard

The Engineer’s Guide to Being Stuck at Home

Analyzing AWS VPC Flow Logs using Apache Parquet Files and Amazon Athena

Airflow : Deployment

Learning more about hash tables

Robotic Process Automation: Benefits of Robotic Process Automation to Power-Up Business Prospect

Overview of the WhatsApp Business API — and how to leverage it

Apache airflow — Dynamic workflow creation using templates

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
James Mak

James Mak

More from Medium

On DevOps — 27. Kubernetes Hardening Tutorial Part 1: Pods

Playing with EKS Fargate

Enabling Datadog monitoring on Amazon Elastic Kubernetes Service (EKS)

The importance of startup and liveness probes in Kubernetes