Chaos Engineering and its evolution

Gaurav Kumar Srivastava
stackspacearena
Published in
7 min readNov 20, 2020

History : Netflix and its Chaos Monkey

Chaos engineering concept was born when Netflix was moving to cloud services and AWS was still quite new. Netflix had to face major availability issues in streaming services, as cloud services back then were in a baby stage and one of the common issues was that instances would occasionally blink out of existence with no warning.

Although, there were many methods of building systems that helped make them resilient to this form of failure event, Netflix wanted to proactively make sure that a streamlined process is set in order to make the system ready for any such turbulent scenarios. They set up a team and Casey Rosenthal was given the responsibility to conceptualize chaos engineering.

Chaos Monkey

Source : https://www.gremlin.com/chaos-monkey/

As a first step towards dealing with cloud instances going off without warning, Netflix came up with an application that would go through a list of clusters, pick one instance at random from each cluster, and at some point during business hours, turn it off without warning. It would do this every workday. This gave a way to test the system resiliency during business hours. Chaos Monkey still exists as part of the Netflix Simian Army, which we will discussing shortly.

Chaos Engineering

The super-formal definition as settled upon, by Casey Rosenthal and his team back 2008 at Netflix was: “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

The definition also mentions “turbulent conditions in production” to highlight that this isn’t about creating chaos. Chaos Engineering is about making the chaos inherent in the system visible.

It is important to understand that the main aim is to make the system more resilient. Resiliency is the ability of of system to withstand failures of components, network , or databases, and continue to deliver in such or more rare circumstances.

Principles of Chaos Engineering

The standard set of principles upon which chaos engineering is built upon are largely inspired by Karl Popper’s principle of falsifiability. The Falsification Principle, proposed by Karl Popper, is a way of demarcating science from non-science. It suggests that for a theory to be considered scientific it must be able to be tested and proven false.

Following are the five advanced practices that set the gold standard for a Chaos Engineering practice:

  • Build a hypothesis around steady-state behavior: This focuses on the output of the the system over a short period of time under normal circumstances to define a proxy for the steady state of the system. The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady state behavior.
  • Vary real-world events: Identify potential events capable of disrupting steady state of the system. Events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event, can be classified as rare world events.
  • Run experiments in production: It is strongly advised to experiment directly on production traffic. `Since the behavior of utilization can change at any time, sampling real traffic is the only way to reliably capture the request path.`
  • Automate experiments to run continuously: Automate disruption causing experiments and run them continuously for both orchestrating and analyzing system behavior.
  • Minimize blast radius: Since chaos engineering experiments are to be run on production environment and with real life traffic, there is always a risk of potential impact on customer experience. While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained.

Netflix Simian army : Then and Now

Source : https://en.wikipedia.org/wiki/Chaos_engineering

Inspired by the success of the Chaos Monkey, Netflix started creating new simians that induce various other kinds of failures into the system and its ability to survive them, so as to ensure a resilient, secure and highly fault tolerant system. Most of Netflix simians act in context of AWS environment. A few significant virtual simians are :

Chaos Kong (Private):

The whole AWS region coming down is rare, but Chaos Kong is designated to take down whole AWS region and simulate the system recovery to such an event. In September, 2015, AWS DynamoDB faced availability issues in their US-EAST-1 Region. Netflix was able to avoid any significant impact by this as it was better prepared because of chaos Kong in being practice already.

Chaos Gorilla (Private): Takes down one whole AWS Availability Zone.

Chaos Monkey (Public): Randomly disables/kills production instances in a carefully monitored production environment . Currently the active version Chaos Monkey 2.0 can only be used to terminate instances within an application managed by Spinnaker. Spinnaker is an open-source, multi-cloud continuous delivery platform that helps you release software changes with high velocity and confidence.

Latency Monkey (Private): Induces artificial delays in RESTful client-server communication layer to simulate service degradation and outages in network, and measures if upstream services respond appropriately. This was never released publicly by Netflix and has eventually been absorbed into the Failure Injection Testing.

Doctor Monkey (Private, publicly adapted): Goes through health checks as well as monitors other external signs of health (e.g. CPU load, memory usage etch) to detect unhealthy instances. Though Doctor monkey is not open source, it has been adapted into tools like Spinnaker.

Janitor Monkey (now Swabbie): Identifies and disposes unused resources to avoid waste and clutter. It checks any given resource against a set of configurable rules to determine if its an eligible candidate for cleanup. It has now been replaced by Swabbie, a Spinnaker service.

New Age Chaos Engineering tools :

Gremlin : A “failure-as-a-service” platform built to make the software systems more resilient. It offers a fully hosted solution to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss. Gremlin offers resiliency testing services in following three flavors :

Resource Gremlins: Throttle CPU, memory, I/O

State Gremlins: Reboot hosts, kill process, travel in time

Network Gremlins: Introduce latency, drop packets, fail DNS etc.

ChaosBlade:

source : https://www.alibabacloud.com/

Released in 2019 by Alibaba, ChaosBlade is a versatile tool for wide range of experiments and target platforms like Docker and Kubernetes. It provides for dozens of attacks including process killing, resource consumption and packet loss. It also provides for application level fault injection for Java, c++ and node.js applications. For beginners , this is a great tool, however it lacks many features like scheduling of experiments, target randomization, centralized reporting and health checks.

Chaos Mesh: It is a Kubernetes native tool , released in 2020, that supports 17 unique attacks including network latency, resource consumption, packet loss, disk I/O latency, system time manipulation and bandwidth restriction. Being a Kubernetes tool, one can adjust the blast radius using Kubernetes labels and selectors. It also provides for a fully featured dashboard

Failure Injection Testing: Netflix created Failure Injection Testing to get better control and border dimension than its Simian army. FIT works by first pushing failure simulation metadata to Zuul which is an edge service developed by Netflix. Zuul handles all requests from devices and applications that utilize the back end of Netflix’s streaming service. Zuul can handle dynamic routing, monitoring, security, resiliency, load balancing, connection pooling, and more. Being an independent service, FIT allowed failure to be injected by a variety of teams, who could then perform proactive Chaos Experiments with greater precision.

Litmus: Similar to Chaos Mesh, Litmus is also a Kubernetes native tool. It provides for exhaustive experiments support for testing containers, pods, and nodes. It has good documentation for each of its experiment and also provides a GitHub repository of experiments, open for public contribution.

Chaos Toolkit: Chaos Toolkit is a python based tool supports chaos experiments in Docker, Kubernetes, bare metal and cloud platforms. Unlike other tools that have pre-defined experiments, Chaos Toolkit lets you define your own. Each experiment consists of Actions and Probes. Actions execute commands on the target system, and Probes compare executed commands against an expected value. While Chaos Toolkit supports a number of different platforms, it does run entirely through the CLI. This makes it difficult to run experiments across multiple systems unless you’re using a cloud platform like AWS or an orchestration platform like Kubernetes. Chaos Toolkit also lacks native scheduling feature, GUI, or REST API.

Kafka-Trogdor: Togdor is a test framework for Apache Kafka and it acts as the primary fault injection tool explicitly for Kafka. Trogdor executes fault injection through a single-coordinator multi-agent process. Trogdor has two built-in fault types.

  • ProcessStopFault: Stops the specified process by sending a SIGSTOP signal.
  • NetworkPartitionFault: Creates an artificial network partition between nodes using iptables

Pumba for Docker: Pumba is a chaos testing tool for performing chaos experiments in Docker. It can kill, pause, stop, and remove Docker containers with highly-configurable selection rules. It can also perform network emulation through delays, packet loss, rate limiting, and more.

Hadoop Killer: It is an open-source tool written in Ruby that provides process-level fault injection by killing user-specified Java processes at user-specified probability. It can be installed using RubyGems and is configured via a simple YAML syntax.

References:

https://netflixtechblog.com/
https://www.gremlin.com/community/tutorials/
https://www.gremlin.com/chaos-monkey/the-origin-of-chaos-monkey/#chaos-monkey-and-spinnaker
https://en.wikipedia.org/wiki/Chaos_engineering
Chaos Engineering : System Resiliency in Practice By Casey Rosenthal and Nora Jones
https://principlesofchaos.org/

--

--