Chaos Engineering: A Journey to Resilience and Reliability

Published in

Nine Publishing’s Product & Technology Team

6 min readJul 10, 2019

Whether we like it or not, we live in a world of chaos. However, we are also constantly trying to manage it to whatever degree of efficiency we can, which is defined as order. But how about the people who manage it so well that things appear rather orderly? Well, that means they might have achieved a relative command over the conditions they are into, compared to the rest of us. In practical terms, it translates in their capacity to respond to unpredictable events and scenarios that often gets better through experience.

The “what” and “why”

What if we can simulate the chaos itself to learn constantly, mature and accomplish that desired mastery? Sounds great as an idea, though we don’t know enough about human dynamics to understand completely how it may or may not play out. But we are already well across the large scale distributed software systems to run that experiment with a goal of making respective systems more resilient and reliable. This idea gave birth to the discipline of Chaos Engineering. It follows a set of principles and is defined as:

… the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

You might be wondering, why bother? Realistically speaking, resilience and reliability directly correlate to revenue and reputation of a business. But what does it mean for the developers and platform engineers? It implies proactively identifying the application failure before they find themselves reacting to an unpredictable incident that might cause unwanted downtime and grievance. A few key benefits of this practice have been summarised in this article.

Are you ready?

Just because something brings excellent benefits doesn’t necessarily mean a business is ready to embrace it unconditionally. Before kicking off the journey of Chaos Engineering discipline, it requires some readiness checks. Without delving into many details, it consists of the below steps:

Review application architecture for identifying failure points, dependencies, impacts and further procedures to recover from the failures.
Identify failure point exceptions (e.g. stateful apps) that may not be ideal to run chaos experiments.
Have chaos testing readiness of the aimed components approved by relevant stakeholders (e.g. application/platform owner).

I’m ready, what’s next?

Bringing chaos into practice evolves over time. A relevant article, published in Capital One tech blog, emphasises the same point.

The practice should be adopted incrementally from lower environments all the way to live production systems. The practice should mature over time and eventually become part of standard development with developers improving systems until they’re not even aware of chaos injection schedule, instead relying on resilient systems to handle it all the time.

Now, the actual chaos experiments can be adopted following two approaches. It partially depends on which stages of the chaos journey one is in or which services or components are being targeted. The approaches are,

Running a GameDay and
Chaos as you go.

GameDay refers to a specific day/hours to run a chaos experiment, where “chaos as you go” is more of a scheduled (can be randomised) approach of causing chaos. But in both cases, relevant parties must be involved and the associated experiments have to be planned, scoped and agreed.

As a starting point, some of the common chaos experiments can be,

Resource exhaustion (e.g. cpu, memory) either by allocating fewer resources or increasing the load on existing ones.
Terminating hosts randomly. Here, hosts can be a virtual machine in the cloud/on-prem data centre, or a pod or container in a microservices environment.
Database failover to secondary.
DNS unavailability.
Blocking access to the storage.

There are many tools available to run chaos experiments these days and some of the most common ones are listed here.

Chaos Engineering at Nine Publishing

Since we’ve migrated almost all of our services to kubernetes running in AWS, the platform has been relatively stable. It’s fair to say that we have fewer incidents and little to no outage in our customer facing sites, in spite of a substantial increase in both the services and subsequent readers and subscribers base. Having this level of confidence makes the platform even more ideal candidate for the chaos engineering discipline.

Humble Beginning!

One of the many benefits of applications deployed in kubernetes is, it’s self-healing, in a sense that potential pod failure is taken care of by ReplicaSet. Hence, we started with a very basic experiment to delete a random pod in our pre-prod environments (i.e. development and test). To achieve that within our microservices ecosystem, we looked at deploying a tool called kube-monkey, which is an implementation of Netflix’s chaosmonkey, but designed for the kubernetes cluster. During the initial exploration, we discovered a bug in the relevant helm chart. Our Kubernetes Admission Control enforces having resources set for anything deployed in the clusters, whereas the kube-monkey helm chart was failing to deploy due to indentation issue with the resources block. A pull-request was raised upstream to fix the problem. However, very soon after successfully deploying that, we discovered a similar tool called Chaoskube which fortunately had a presence in the stable helm chart repo that has eventually been implemented.

Consideration and Contribution

A few considerations that we’ve made to be on the safer side of the experiment:

We introduced an annotation to label the critical pods so that those get excluded from chaos attack.
The kube-system namespace has been deliberately left alone from the experiment as it’d require more maturity in the journey to be able to play with something as sensitive as dns-controller or kube-proxy.
We started with ruling out weekends, public holidays and outside business hours before we felt confident to conduct the experiment 24x7 (Thanks to automation!)
Random pod deletion interval was initially 15 minutes and later on settled for 60 minutes.

Fast forward a few weeks, we made a little contribution to the Chaoskube project and subsequent helm chart by adding the JSON formatted log feature which helps to run a better query with our Prometheus-Grafana monitoring duo. On that same note, Chaoskube logs the event when a pod gets deleted from a particular namespace.

Learning and the Path Ahead

We are still at the early stage of our chaos experiments in pre-production environments. The most important lesson so far has been the reinforced sense of confidence in the kubernetes ecosystem that it can survive reasonable disruption. More importantly, no stateful applications (e.g. elasticsearch, kafka) have been excluded from the experiment and thus ensures service continuity over the last 6 months or so in a baseline chaotic environment.

Without being too optimistic, have we realised the full potential of chaos engineering discipline already? The answer is “No”. The truth is, we have a long distance to travel. But in the short term, we can take some more concrete steps to reap more benefits off of it.

We can run one or more GameDay utilising existing Chaoskube deployed in test and development clusters. It can be carried on simply by reducing the attack interval from 60 minutes to 15 minutes to 5 minutes.
Chaoskube deployment can easily be extended to the production environment, starting with a much safer attack interval and tuning it later based on the learning.
Introducing chaos in the kubernetes node (i.e. EC2 instance) level by exploiting something similar to ChaosMonkey.

Once the primary phase of chaos experiments are over and thus enough is learned and further the learning is fed back to the systems and process, more critical components of the platform (e.g. DNS) can be experimented with. In reality, controlled chaos should never stop, since we would rather not find ourselves in an unexpected one.