Image for post
Image for post
Chaos Engineering for Kubernetes

Introduction To Chaos Engineering & Its Principles for Kubernetes

Why Chaos Engineering For Kubernetes?

We all know that Kubernetes is the future for containerized application and software development. Kubernetes doesn’t just read the description of the application that a developer writes but also helps collect information about the current state of the infrastructure. It uses all this information to make changes to the infrastructure. When the infrastructure looks the same as the developer’s specification, one assumes his/her job is done.

But do you know that there are various possible failures that could occur in a Kubernetes cluster running stateful workloads? Have you ever thought how will your system fare if it isn’t resilient? Resilience is how well a system withstands faults — a highly resilient system, for example, one built with loosely coupled microservices that can themselves be restarted and scaled easily, overcomes such faults without impacting users. Even if all the individual services in a Kubernetes architecture are functioning properly, the interactions between those services can cause outcomes that are hard to predict.

Reliable systems are impossible without experiencing failures. To counter failures and outages in the system as well as infrastructure, Chaos Engineering comes into play as a means to gauge system behavior, blast radius, and recovery times. In simple words, Chaos Engineering is the practice of injecting faults into a system before they naturally occur. It basically involves experimenting on a distributed system in order to gain assurance in the system’s ability to withstand chaotic conditions. With today’s frequently changing and highly complex systems, chaos engineering needs to be accepted as an important approach to achieve resilient infrastructure. Through chaos engineering, unanticipated failure scenarios can be discovered and corrected before causing user issues.

Who Should Use Chaos Engineering?

“Anyone wanting a resilient Kubernetes architecture must practice Chaos Testing.”

All businesses, developers, and SRE’s looking forward to reducing system failures and outages along with a goal to reduce revenue loss must use chaos engineering tools for testing purposes on their Kubernetes architecture. Netflix started applying Chaos Engineering through Chaos Monkey to strengthen its physical infrastructure and now it has become a de facto in the world of automated and resilient systems. From the e-commerce industry to the finance industry, various companies are switching to Chaos testing and Chaos Engineering tools.

Principles Of Chaos Engineering

A few essential principles describe an ideal infrastructure for performing Chaos Engineering. Experiments are conducted to identify the weaknesses of a system. Once discovered, it provides a scope for improvement before the weakness manifests in the system in a greater way.
The following principles are adhered to for conducting chaos experiments:
I. Formulate a hypothesis: For a start, assuming how the system might react if experimenting goes wrong helps in the formation of a hypothesis for the output of the system that can be measured.
II. Identifying the variables: A probable variable for a Chaos experiment can be identified by comparing a Chaos environment with real-life events as chaos variables are a reflection of real-life events. Hence, it is very important to identify and prioritize the variables according to the probability and estimated impact of the event.
III. Automation of running experiments: It is extremely important to tie the act of Chaos itself with automated orchestration (steady-state condition checks) and analysis (data availability & integrity, application, and storage deployments’ health, etc.,). An experiment can’t be run manually as it demands comprehensive labor work and eventually not sustainable. Hence, Chaos Engineering builds automation into the system by running them continuously.
IV. Reduced Blast Radius: To avoid a negative impact or drastic results, it is crucial to control the Blast Radius and conduct experiments at a reduced blast radius to ensure minimum negative influence or a short-term negative influence.
V. Scaling the Blast Radius: The experiment is a success if a fault is identified with the reduced blast radius or else one can always scale the blast radius until a proper figure is obtained. It helps to improvise the system’s real-life behavior.


With the microservices architecture replacing the monolithic architecture, chaos engineering comes up as an essential and dynamic practice to make distributed systems rigid and stabilized by identifying all the possible faults and inconsistencies, all in all resulting in resilient infrastructure.

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?
Join the #litmus channel on Kubernetes Slack for detailed discussion, feedback & regular updates on Chaos Engineering for Kubernetes:
Check out the LitmusChaos GitHub repository.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store