Building Resilient Systems with Chaos Engineering

Rahul Ranganathan
Google Cloud - Community
6 min readJul 27, 2023

Introduction

Let’s say your customer is in the middle of an online transaction. The customer is either transferring funds through internet banking, executing a stock market transaction, or withdrawing cash at the ATM. And suddenly, the website crashes, or there is an unexpected power outage at the bank’s data center or the system breaks. The money does get deducted, but the transaction fails. Consequence? Lost revenue, a frustrated customer, grim user sentiment, and damage to the brand’s reputation.

Does this sound familiar?

While testing is standard practice in software development, it’s not always easy to foresee issues that can happen in production. Especially as systems become increasingly complex to deliver maximum customer value.

The adoption of micro-services enables faster release times and more possibilities than we’ve ever seen before, however they introduce challenges. Now that systems are hosted on globally distributed infrastructures, it’s hard to predict what failure might occur to the system. It can be challenging for IT professionals to think beyond their specific focus area when planning for outages.

Many organizations invest in high availability and disaster recovery for their key applications but when an issue arises are they able to triage and identify the root cause in a minimum amount of time and restore application to a steady state?

Incident response management, alerting, metrics/logging, disaster recovery — all great, but they are all reactive. They aren’t bad, but they aren’t sufficient.They focus on time-to-detect and time-to-remediate. We need proactive methods.

Can we introduce controlled chaos on our systems and observe failures to build highly resilient applications? Yes we can with Chaos Engineering.

History of Chaos Engineering

Chaos engineering was pioneered at Netflix in 2010, where they developed a service called Chaos Monkey, which would randomly terminate VM instances or containers in the production environment. The disruptions caused by Chaos Monkey are designed to emulate real-world events, such as server or data center failure. With this tool, the company aimed to ensure the termination of an Amazon Elastic Compute Cloud (EC2) instance wouldn’t affect the overall service experience. The learnings from Chaos Monkey forced engineering teams to build more fault-tolerant solutions.

What is Chaos Engineering?

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos engineering is a new branch of testing and tuning an application with a focus on disruptive environmental issues not typically considered by standard Performance Testing. It involves doing thoughtful experiments designed to replicate turbulence in a system.

The name “Chaos Engineering” comes from the Chaos Theory, which analyzes seemingly “chaotic” or random phenomena and finding systematic patterns underlying them. Chaos engineering seeks to reproduce what would normally be considered “unforeseen” events such as server outage, in a predictable and systematic way.

The aim of this seemingly destructive practice is to help development teams identify vulnerabilities in their architecture that come to the surface during these generated system failures. The practice brings together a cross-functional team to help companies understand the blast radius of applications within their production environment and how they can use automation to failover seamlessly when outages happen.

We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.

Principles of Chaos Engineering

  1. Define your ‘steady state’ (a measurable output of a system that indicates normal behavior).
  2. Make a hypothesis. What do you think will happen?
  3. Introduce variables that reflect real-world events: crashed servers, severed network connections, latency, unavailability of nodes etc.
  4. Execute chaos experiments to disprove the hypothesis
  5. Analyze results and retest
  6. Analyse and re-factor/re-architect applications.
  7. Automate the entire process.

Benefits of Chaos Engineering

  1. Boosts resilience and reliability: Chaos testing lessons foster a resilient culture, guaranteeing consistent performance and readiness for unforeseen difficulties.
  2. Fuel Innovation: Observations gathered by deliberately introducing controlled disturbances into software systems enable engineers to make design adjustments that improve robustness and raise production quality.
  3. Improved collaboration among technical teams: Chaos testing helps foster collaboration among Software Engineers, Architects, Dev-ops , System admins, Security and Networking teams.
  4. Reduced Costs: Chaos testing can help to reduce the costs associated with system failures, security incidents, and performance bottlenecks
  5. Increased Customer Satisfaction: With fewer outages and zero-downtimes, businesses are able to cater to customer requirements rapidly and increase revenue.

Chaos Testing Tools: Chaos Mesh

Chaos Mesh is an opensource Chaos Engineering Platform for Kubernetes. Chaos Mesh was accepted to CNCF on July 14, 2020 and is at the Incubating project maturity level.It offers various types of fault simulation and has an enormous capability to orchestrate fault scenarios.

Using Chaos Mesh, you can conveniently simulate various abnormalities that might occur in reality during the development, testing and production environments and find potential problems in the system. To lower the threshold for a Chaos Engineering project, Chaos Mesh provides you with a visualization operation. You can easily design your Chaos scenarios on the Web UI and monitor the status of Chaos experiments.

Installing Chaos Mesh on the GKE Cluster:

A simple one liner can install chaos mesh on your K8s cluster. For more advanced production scenarios it can be installed with Helm command.

curl -sSL https://mirrors.chaos-mesh.org/v2.6.1/install.sh | bash

After running this script, Chaos Mesh automatically installs the CustomResourceDefining (CRD) that matches the version, all required components, and related Service Account configurations.

Chaos Mesh Dashboard

Chaos Mesh Dashboard

Creating an experiment to test our hypothesis of system behavior. We will simulate an experiment to kill pods in our application.

Multiple fault injection experiment options

Under the Kubernetes, we select Pod Fault and choose Pod Kill option for the specified namespace.

Lets choose to target a specific pod

Lets observe the experiment results in Chaos Mesh Dashboard and verify in the console. Chaos Mesh deleted the pod and Kubernetes restarted the pod. By increasing the duration of the test you can observe your application user experience.

Summary

Chaos Engineering is a powerful practice that is already changing how software is designed and engineered. The overarching goal of Chaos Engineering is to improve the reliability of our applications and systems by testing how they handle failure.
Where other practices address velocity and flexibility, Chaos specifically tackles systemic uncertainty in these distributed systems.
The Principles of Chaos provide confidence to innovate quickly at massive scales and give customers the high quality experiences they deserve.

References

--

--