A Primer on Resiliency and Chaos Engineering

How to unleash chaos on your software to ensure resiliency

Published in

Slalom Technology

5 min readJun 27, 2023

Recently, I worked with a financial services client to design and execute chaos engineering to test and improve the resiliency of their application. You may have heard about chaos engineering through popular tech culture or literature. Notably some big tech companies, like Netflix, are known to use tools like Chaos Monkey by Netflix.

But what is chaos engineering?

Chaos engineering is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions. It relies on concepts underlying chaos theory, which focuses on random and unpredictable behavior. The goal of chaos engineering is to identify weakness in a system through controlled experiments that introduce random and unpredictable behavior.

Modern architectures commonly leverage micro-services and emphasize loosely coupled applications. This type of approach offers a ton of benefits, and yet ensuring reliability can be a challenge — especially when it comes to applications with complex designs and multiple teams working in tandem.

Today, many businesses face the challenge of balancing high availability and consistency. To accomplish this successfully, it is critical to test your applications under various scenarios and conditions. And that is where resiliency and chaos tests come into picture. Through a series of blogs, I’ll help you explore the world of chaos and resiliency engineering so you can understand what constitutes a well-architected and resilient system.

Chaos ≠ resiliency; chaos ≅ resiliency

On the surface, a resiliency test might look like a chaos test and vice versa. However, in reality they are two different things.

Resiliency tests deal with the known unknowns (i.e., risks you are aware of that a software application can face, and their implications).

Resiliency testing is like a medical checkup for your software — you are checking to make sure the application is healthy, strong, and able to bounce back from any anticipated malady it may face. For example, what happens when the AWS EC2 instance running the application terminates unexpectedly (a known phenomenon)? Is the system resilient enough to withstand that and recover? How long does it take the system to recover from an event like that? Is there any data loss? (An unknown result.)

Chaos tests deal with the unknown unknowns (i.e., risks that we are neither aware of nor understand their effects on the application).

Chaos testing is like throwing a surprise party for your software. Instead of balloons and cake, you give your application a series of unexpected challenges and watch how it reacts. For example, you know your EC2 instance can recover from an unexpected failure, but what happens when inbound and outbound traffic to the port are disabled, as might happen in a network black-hole kind of event? (An unknown phenomenon.) What safeguards are in place to make sure that the application can self-heal and there is no data loss? (An unknown result.)

Various stages of developing and running a chaos experiment — The cycle of chaos testing

So you might think of a resiliency test as the final, evolved form of a chaos test.

Designing a chaos experiment

Now that you know the differences between chaos and resiliency testing, I’ll show you how to design a chaos experiment for your application.

With chaos testing, the more experiments you run, the more unknown unknowns you discover. These then become known unknowns, which can be analyzed and logged as defects in the system and patched. Then the application can be updated.

Initial considerations for chaos testing include:

1. Clearly defining your how, what, and why

Basic building blocks of a chaos experiment — Defining a chaos test

2. Selecting your tools carefully

Selecting your tools is just as important as defining the experiment; there are two major considerations here:

The chaos engineering tools you will use (e.g., AWSFault Injection Simulator, Gremlin, Chaos Monkey, etc.)
The experiment execution methodology (e.g., run it via console or using a continuous integration/continuous delivery platform like GitLab, Harness, etc.)

Sample experiment template using AWS

Below is an example of an AWS Fault Injection Simulator (FIS) chaos template to run a stop EC2 instance experiment written in JSON. This template serves as a reference point for the FIS to run a chaos test against example application instances.

{ 
    "tags": { 
        "Name": "StopEC2InstancesByCount" 
    }, 
    "description": "Stop and restart three instances with the specified tag", 
    "targets": { 
        "myInstances": { 
            "resourceType": "aws:ec2:instance", 
            "resourceTags": { 
                "env": "prod" 
            }, 
            "selectionMode": "COUNT(3)" 
        } 
    }, 
    "actions": { 
        "StopInstances": { 
            "actionId": "aws:ec2:stop-instances", 
            "description": "stop the instances", 
            "parameters": { 
                "startInstancesAfterDuration": "PT2M" 
            }, 
            "targets": { 
                "Instances": "myInstances" 
            } 
        } 
    }, 
    "stopConditions": [ 
        { 
            "source": "aws:cloudwatch:alarm", 
            "value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:alarm-name" 
        } 
    ], 
    "roleArn": "arn:aws:iam::111122223333:role/role-name" 
}

The main sections of the example template are:

targets: specifies which EC2 instances need to be targeted based on a tag
actions: describes what action(s) should be taken (in this example we are stopping the targeted instances)
stopConditions: specifies when to stop experiment execution (i.e., once a certain condition is met); in this example the experiment is ended when a specific CloudWatch alarm is triggered
roleArn: specifies the identity access management (IAM) role that grants FIS permission to execute the actions

Leading practices in chaos engineering

Know your architecture — be familiar with the ins and outs of the system (or at least the part of the system you are testing) and what impact can be expected.
Understand steady state — know what steady state means for your application; defining what normal looks like for your system enables the experiment to give tangible results about any anomalies.
Establish observability and monitoring — observability and monitoring tools play a crucial role in any system, including chaos testing. Real-time monitoring can give you live updates on the status and health of your application throughout your experiments and testing.
Define your SLOs and SLAs — a core principle of site reliability engineering (SRE) is establishing service level agreements (SLAs) and service level objectives (SLOs). These metrics help you determine the impacts of the experiment results and if your experiment was successful.

Conclusion

Resiliency and chaos are two sides of the same coin. Chaos tests can provide real, meaningful data about your overall application and identify shortcomings, and — as you keep experimenting — the unknown unknowns will slowly change into known knowns. From these new known knowns, you can design a full suite of resiliency tests. Chaos testing can be a useful tool for achieving reliability and stability of your application. Remember, the more resilient your system is, the better it will perform under chaotic conditions, and the fewer headaches you’ll have in production!

May the chaos be with you!

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.