EXPEDIA GROUP TECH — ENGINEERING

Chaos Engineering for AWS Resources: A Custom Approach

Building Resilience Through Controlled Chaos and Automated Solutions

Asmita Bharti

Published in

Expedia Group Technology

9 min readMar 26, 2024

A picture capturing an individual observing a town beside a river bustling with boating activities. — Photo by Héctor Martínez on Unsplash

Introduction
In the dynamic world of digital resources, ensuring the resilience of our systems is paramount. This document outlines our comprehensive chaos strategy, which aims to validate, fortify, and test the stability of AWS resources. Our strategy can be implemented for AWS resources like EC2, RDS, EKS, etc. We leverage AWS Systems Manager (SSM) documents and custom scripts to enhance our systems’ ability to withstand disruptions and ensure uninterrupted data services. It helps show how the system responds when critical components of its production service infrastructure are taken down. Deliberately inducing such a failure would reveal any vulnerabilities within their systems, thereby directing us towards automated solutions adept at gracefully managing similar failures in the future.

Why chaos custom solution? 💡

Flexible Testing: Other tools like Chaos-mesh and Chaos Toolkit were restrictive, often used with specific test cases, and suited only for specific setups like Kubernetes clusters.
Customize Chaos: Custom scripts allow us to design chaos testing solutions tailored specifically to our resource needs.
Adaptability: We can customize and control chaos tests to match our unique needs, making our testing more effective and adaptable to our different resources.

An image of tangled threads that are being untangled by a group of people — Image Source: Dev Community

High-level diagram 🌐:

Our chaos creation strategy operates as follows:

Spinnaker: Our Workflow starts from the spinnaker, where we can specify the choice of chaos and include other relevant parameters.
Jenkins Job: A Jenkins job specifically triggers chaos experiments on our resources. This job utilizes parameters to customize the chaos tests, such as the type of chaos (e.g., Disk Stress, CPU Stress, Index Deletion), instance-filtering criteria (based on tags, instance IDs, etc.), the number of instances to target, and other relevant settings.
AWS Systems Manager (SSM) Document: SSM documents that define the chaos experiments for resources (e.g., chaos for Elasticsearch, MongoDB, etc.). These documents will encapsulate the specific chaos scenarios, ensuring consistency and repeatability.
Chaos Execution: The Jenkins job triggers the relevant SSM document using AWS Systems Manager’s Run Command. This initiates chaos on the targeted resource.
Monitoring with Datadog: As the chaos unfolds, we can monitor the system’s performance and behavior using Datadog. This real-time monitoring allows us to gather crucial metrics and insights into how our resources and applications respond to chaotic conditions.

A meme related to having the first chaos experiment in production — Image Source: Gremlin

Elasticsearch database chaos ⚡:

An image of elastic bands added to resemble to elastic database — Image Source: Skedler

We have implemented our custom solution to create chaos on EC2-based databases. One such database is Elasticsearch.

SSM Document: In System Manager, an SSM document for creating chaos needs to be crafted for Elasticsearch. This custom script facilitates the execution of chaos scenarios based on specified actions. It utilizes aws:runshellscript to run various ssh commands and generate chaos. It empowers administrators to apply controlled disruptions and test the resilience of their Elasticsearch environment, ensuring its robustness under adverse conditions.

# This code is licensed under the MIT License.
# Copyright 2023 Expedia®, Inc.
# SPDX-License-Identifier: Apache-2.0
{
  "schemaVersion": "2.2",
  "parameters": {
    "action": {
      "type": "String",
      "allowedValues": [
        "DiskStress",
        "CPUStress",
        "IndexDeletion",
        "NodeTermination",
        "ESserviceStart",
        "ESserviceStop",
        "Latency",
        "NetworkBlock"
      ]
    },
    "Index": {
      "type": "String",
      "default": "123",
      "description": "(Optional)"
    },
    "InstanceIP": {
      "type": "String",
      "default": "IP",
      "description": "(Optional)"
    },
    "Percent": {
      "type": "String",
      "default": "80",
      "description": "(Optional)"
    },
    "Latency": {
      "type": "String",
      "default": "1000",
      "description": "(Optional)"
    },
    "Duration": {
      "type": "String",
      "default": "120",
      "description": "(Optional)"
    }
  },
  "mainSteps": [
    {
      "action": "aws:runShellScript",
      "name": "DiskStress",
      "precondition": {
        "StringEquals": [
          "{{ action }}",
          "DiskStress"
        ]
      },
      "inputs": {
        "runCommand": [
          "sudo amazon-linux-extras install epel -y &>> /dev/null",
          "sudo yum install stress-ng -y &>> /dev/null",
          "sudo stress-ng --hdd 8 --hdd-bytes {{Percent}}% --timeout {{Duration}}s",
          "InstanceID=$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id)",
          "echo DiskStress executed Successfully on Instance-ID $InstanceID for {{Duration}}s "
        ]
      }
    },
    {
      "action": "aws:runShellScript",
      "name": "CPUStress",
      "precondition": {
        "StringEquals": [
          "{{ action }}",
          "CPUStress"
        ]
      },
      "inputs": {
        "runCommand": [
          "sudo amazon-linux-extras install epel -y &>> /dev/null  ",
          "sudo yum install stress-ng -y &>> /dev/null ",
          "sudo stress-ng -c 0 -l {{Percent}} --timeout {{Duration}}s",
          "InstanceID=$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id)",
          "echo CPUStress executed Successfully on Instance-ID $InstanceID for {{Duration}}s"
        ]
      }
    },
    {
      "action": "aws:runShellScript",
      "name": "IndexDeletion",
      "precondition": {
        "StringEquals": [
          "{{ action }}",
          "IndexDeletion"
        ]
      },
      "inputs": {
        "runCommand": [
          "sudo curl -k --cacert /elasticsearch/config/ssl/{{InstanceIP}}.crt -XDELETE -u $user:$password https://{{InstanceIP}}:9200/{{Index}}",
          "output=$(sudo curl -k --cacert /elasticsearch/config/ssl/{{InstanceIP}}.crt -u $user:$password https://{{InstanceIP}}:9200/_cat/indices/ | grep -i '{{Index}}')",
          "if [[ -n $output ]]; then echo '{{Index}} index exists'; else echo '{{Index}} index deleted'; fi"
        ]
      }
    },
    {
      "action": "aws:runShellScript",
      "name": "NodeTermination",
      "precondition": {
        "StringEquals": [
          "{{ action }}",
          "NodeTermination"
        ]
      },
      "inputs": {
        "runCommand": [
          "InstanceID=$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id)",
          "echo shutdown complete of Instance-ID $InstanceID",
          "sudo shutdown -h now",
          "echo"
        ]
      }
    },
    {
      "action": "aws:runShellScript",
      "name": "ESserviceStart",
      "precondition": {
        "StringEquals": [
          "{{ action }}",
          "ESserviceStart"
        ]
      },
      "inputs": {
        "runCommand": [
          "sudo service elasticsearch start",
          "systemctl status elasticsearch | grep Active",
          "InstanceID=$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id)",
          "echo ESserviceStart operation performed on Instance-ID $InstanceID"
        ]
      }
    },
    {
      "action": "aws:runShellScript",
      "name": "ESserviceStop",
      "precondition": {
        "StringEquals": [
          "{{ action }}",
          "ESserviceStop"
        ]
      },
      "inputs": {
        "runCommand": [
          "sudo service elasticsearch stop",
          "systemctl status elasticsearch | grep Active",
          "InstanceID=$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id)",
          "echo ESserviceStop operation performed on Instance-ID $InstanceID"
        ]
      }
    },
    {
      "action": "aws:runShellScript",
      "name": "Latency",
      "precondition": {
        "StringEquals": [
          "{{ action }}",
          "Latency"
        ]
      },
      "inputs": {
        "runCommand": [
          "InstanceID=$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id)",
          "InstanceIp=$(wget -q -O - http://169.254.169.254/latest/meta-data/local-ipv4)",
          "sudo yum install iproute-tc -y &>> /dev/null",
          "sudo tc qdisc add dev eth0 root netem delay {{Latency}}ms",
          "echo Latency of {{Latency}}ms execute on Instance-ID $InstanceID",
          "tc -s qdisc | grep netem | cut -d ' ' -f 11,12 ",
          "sudo curl -k --cacert /elasticsearch/config/ssl/$InstanceIp.crt -u $user:$password https://$InstanceIp:9200/_cat/indices/*?v=true&s=index",
          "sleep {{Duration}}",
          "sudo tc qdisc del dev eth0 root netem"
        ]
      }
    },
    {
      "action": "aws:runShellScript",
      "name": "NetworkBlock",
      "precondition": {
        "StringEquals": [
          "{{ action }}",
          "NetworkBlock"
        ]
      },
      "inputs": {
        "runCommand": [
          "InstanceID=$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id)",
          "InstanceIp=$(wget -q -O - http://169.254.169.254/latest/meta-data/local-ipv4)",
          "echo $InstanceIp",
          "sudo iptables -A INPUT -s $InstanceIp -j DROP",
          "sudo iptables -L INPUT",
          "echo NetworkBlock executed on Instance-ID $InstanceID for {{Duration}}s",
          "sleep {{Duration}}",
          "sudo iptables -D INPUT -s $InstanceIp -j DROP"
        ]
      }
    }
  ]
}

An Image explaining the workflow of Chaos — Workflow of Chaos in Elasticsearch

Jenkins Job: We have created a Jenkins job to trigger chaos in EC2 instances.

Jenkins job (elasticsearch-chaos-trigger) parameters:

Action: Disk Stress, CPU Stress, Index Deletion, Node Termination, Elasticsearch Start/Stop, Latency and NetworkBlock.
Tag Key, Tag Value: Filtering instances based on cluster name, name, instance ID.
Count: Specifies the number of instances to be targeted.
Index: Name of the Elasticsearch index to be deleted.
Percent: This defines how much percentage of stress should be executed in case of CPU Stress and Disk Stress.
Latency: Latency in milliseconds for requests.
Duration: For how long an action should be performed.

Jenkins job build step (Execute Shell):

# This code is licensed under the MIT License.
# Copyright 2023 Expedia®, Inc.
# SPDX-License-Identifier: Apache-2.0

#!/bin/bash
 if [ "$TAG_KEY" == "ClusterName" ] || [ "$TAG_KEY" == "Name" ]; then
    TAG_KEY=tag:$TAG_KEY
 fi
# Use the AWS CLI to describe instances with the specified tag and limit to no.instances
instance_ids=$(aws ec2 describe-instances \
  --region "us-west-2" \
  --filters "Name=instance-state-name,Values=running" "Name=$TAG_KEY,Values=$TAG_VALUE" \
  --query "Reservations[].Instances[].[InstanceId, Tags[?Key=='Name'].Value | [0]]" \
  --output text |head -n $Count)

# Check if the instance_ids variable is empty or contains only whitespace
if [[ -z "$instance_ids" ]]; then
  echo "No instances found with the specified tag."
  exit 1
fi

# Split the instance IDs into an array
IFS=$'\n' read -rd '' -a instance_id_array <<< "$instance_ids"

# Loop through the instance IDs
for instance_id in "${instance_id_array[@]}"; do
  echo "instance ID: $instance_id"
done
 
 for element in "${instance_id_array[@]}"; do
    # Remove text after the first space
    modifiedElement=$(echo "$element" | awk '{print $1}')
 
    modifiedArray+=("$modifiedElement")
done
modifiedElementsCSV=$(IFS=','; echo "${modifiedArray[*]}")

# Display the modified elements separated by commas
echo "#########################Performing $action action for InstanceIDs: $modifiedElementsCSV ########################################"
echo $action

InstanceIP=$(aws ec2 describe-instances \
  --region "us-west-2" \
  --instance-ids "${modifiedArray[0]}" \
  --query 'Reservations[0].Instances[0].PrivateIpAddress' \
  --output text)


sh_command_id=$(aws ssm send-command \
    --targets Key=instanceids,Values=["$modifiedElementsCSV"]\
    --document-name "Chaos_ES" \
    --region "us-west-2" \
    --comment "run shell script on Linux Instances" \
    --parameters "{\"action\":[\"$action\"],\"InstanceIP\":[\"$InstanceIP\"],\"Index\":[\"$Index\"],\"Percent\":[\"$Percent\"],\"Latency\":[\"$Latency\"],\"Duration\":[\"$Duration\"]}"\
    --output text \
    --query "Command.CommandId") 

echo "CommandId: $sh_command_id"

# Define the desired status to check for (e.g., "Success")
desired_status="Success"

# Initialize a variable to hold the current status
current_status="InProgress"

while [ "$current_status" != "$desired_status" ]; do
    # Get the status of the SSM Run Command
    status_output=$(aws ssm list-command-invocations \
        --command-id "$sh_command_id" \
        --region us-west-2 \
        --query "CommandInvocations[0].Status" \
        --output text)

    # Check if the current status matches the desired status
    if [ "$status_output" = "$desired_status" ]; then
         echo "Current status: $status_output"
         echo "###############################CommandID $sh_command_id for action $action is Success ###########################"
         break
    elif [ "$status_output" = "Failed" ]; then
         echo "Current status: $status_output"
         echo "###############################CommandID $sh_command_id for action $action is Failed ###########################"
         break
    else
        echo "###############################CommandID $sh_command_id for action $action is InProgress ###########################"
        sleep 30
    fi

done

    

sh -c "aws ssm list-command-invocations \
    --command-id \"$sh_command_id\" \
    --region us-west-2 \
    --details \
    --query 'CommandInvocations[].CommandPlugins[].{Status:Status,Output:Output}'"| sed 's/----------ERROR-------/ /g'

This Jenkins job triggers the chaos SSM document, which then orchestrates chaos on the specified EC2 instances running Elasticsearch.

Spinnaker Pipeline: We have created a Spinnaker pipeline to trigger chaos in the Elasticsearch database consequently running the Jenkins build. The pipeline has the following parameters: Key, Value, Count, IndexName, Percent, Latency, Duration.

An image depicting the Spinnaker Pipeline for chaos — Spinnaker Pipeline for chaos

An image depicting all the values that can be added as parameters of the pipeline — Values that can be added as parameters of the pipeline

CPU stress output:

We implemented CPU stress chaos on instances, and the outcomes could be verified on the Jenkins console as well as by examining Datadog metrics.

An image depicting the jenkins console where we can verify the CPU stress chaos — Verifying CPU stress chaos on the Jenkins console

IndexDeletion output:

We implemented IndexDeletion chaos on instances, and the outcomes can be verified either on the Jenkins console or by checking Datadog metrics.

An image depicting the jenkins console where we can verify the Index Deletion — Verifying Index Deletion on the Jenkins Console

Monitoring with Datadog: Real-time insights during chaos experiments 🔍

We leverage Datadog’s Dashboard as a vital element in our chaos engineering strategy for ensuring the resilience of our EC2 databases. Datadog provides real-time monitoring and actionable insights during intentional chaos inductions. It offers immediate visibility into disturbances, allowing us to assess their impact instantly. Critical performance metrics such as CPU usage, memory consumption, and query response times are deeply monitored. Datadog’s proactive alerting and anomaly detection enable swift responses to irregularities. Customized dashboards provide real-time data for specific chaos experiments, enhancing clarity. The historical data analysis feature helps identify trends and patterns in database behavior.

An image of how a Chaos related datadog dashboard looks like — Chaos Datadog Dashboard

Datadog’s integration empowers data-driven decision-making, efficient incident response, continuous refinement, and validation of system resilience. It is a cornerstone in our journey to create resilient, uninterrupted data services in our EC2 databases through controlled chaos.

Advantages 🌟

Chaos strategy provides several key advantages:

Validation of Alerts: By subjecting systems to controlled chaos, we validate the effectiveness of existing monitoring and alerting mechanisms.
Enhanced Backup and Restore Procedures: Chaos testing reveals gaps in backup and restore processes, allowing us to refine and fortify these critical operations.
Streamlined Alerting and Restoration: We can ensure that alerting and restoration workflows operate smoothly and efficiently, minimizing downtime and disruption.
Reduced RTO and RPO: Through proactive chaos testing, we can reduce both Recovery Time Objective (RTO) and Recovery Point Objective (RPO) metrics, enabling quicker recovery and minimal data loss in case of failures.

Conclusion

Chaos strategy is a good practice for maintaining robust and resilient database systems. Similar workflows can be implemented for other stand-alone EC2, and custom scripts can be made for other resources. This will strengthen the ability to provide uninterrupted data services and test the effectiveness and stability of resources.

Learn about life at Expedia Group™