Creating your own Chaos Monkey with AWS Systems Manager Automation

Chaos Engineering on AWS

Adrian Hornsby
Jun 15, 2020 · 12 min read
Image for post
Image for post

I’d like to express my gratitude to my colleagues and friends Jason Byrne and Matt Fitzgerald for their valuable feedback.

Image for post
Image for post

In a recent post, I explained how to use AWS SSM Run Command to inject failures on EC2 instances. SSM Run Command is well-suited to execute custom scripts on EC2 instances, especially to inject latency or blackouts on the network interface, do resource exhaustion of CPUs, memory, and IO.

However, we need more than that. Failure injection should targetand, and also the .

We also need to have a broad set of controls and capabilities to perform chaos experiments safely. We might want to:

  • directly into EC2 instances.
  • Invoke Lambda functions to
  • several failure injections to form chaos scenarios.
  • them for execution at specific times.
  • Have if errors are detected.
  • Have in places with approvals.
  • Apply velocity controls to of experiments.

That is where AWS System Manager Automation (SSM** Automation) comes in. So, let’s take a look!

** Note: AWS Systems Manager was formerly known as Amazon Simple Systems Manager (SSM). The original abbreviated name of the service, SSM, is still used and reflected in various AWS resources.

What is SSM Automation?

SSM Automation was launched to simplify frequent maintenance and deployment tasks of AWS resources and, especially, codify them.

Image for post
Image for post
SSM Automation in a nutshell

SSM Automation uses documents (defined in YAML or JSON) to enable resource management across multiple accounts and AWS regions. You can execute AWS API calls as part of a document in combination with other SSM Automation actions such as running commands on your EC2 instances, invoking Lambda functions, and executing custom Python or Powershell scripts.

Image for post
Image for post
SSM Automation document

While these documents can be executed directly via the console, the CLI, and SDKs, you can also schedule and trigger them through CloudWatch Events. This scheduling capability makes the integration with CI/CD pipelines trivial.

SSM Automation Action types

let you automate a wide variety of operations. For example, the aws:executeAwsApi action type used above enables you to run any API operation on any AWS service, including creating or deleting AWS resources, starting processes, triggering notifications, etc.

While SSM Automation supports a wide variety of actions, the most notable ones for chaos engineering are the following:

SSM Automation also includes safety and velocity features that help you control the execution and the roll-out of these documents across large groups of instances by using tags, limits, and error thresholds you define.

As you can probably guess by now, SSM Automation is also well-suited to execute chaos engineering experiments safely.

“Hello, World!”

Let’s take a look at the “Hello, World!” of chaos engineering experiments — .

This experiment is famously known as Chaos Monkey, and was created by Netflix to enforce strong architectural guidelines; Applications launched on the AWS cloud must be stateless auto-scaled micro-services. That means that applications running Netflix should tolerate random EC2 instance failures.

Following is an SSM Automation document (described in YAML) randomly failing an EC2 instance in a particular AWS availability zone.

To open that SSM Automation document in your favorite IDE, click here.

Image for post
Image for post

Okay — so what do we have here?

Note: For readability purposes, I will now collapse irrelevant sections of the SSM Automation document.

The top section of this document is simple. It starts with a , the (currently at 0.3 ), and which is the IAM role that SSM Automation needs to assume to run the actions defined below in the document.

Image for post
Image for post

The section —, , , and — are parameters operators need to input for each experiment’s execution. The first three parameters are used in the first stepto filter EC2 instances, while the last one is the IAM role required to execute actions described in the document.

These parameters are inputs of the experiment execution, in bold in the below AWS CLI command:

> aws ssm start-automation-execution --document-name "StopRandomInstances-API" --document-version "\$DEFAULT" --parameters '{"":["eu-west-1c"],"":["SSMTag"],"":["chaos-ready"],"":["arn:aws:iam::01234567890:role/SSMAutomationChaosRole"]}' --region eu-west-1

mainSteps

The section defines that SSM performs on AWS resources. In this document there are six steps that run in order — namely , , , , , and .

Each of these steps defines a single . The output from one step can be used as input in the following step.

Image for post
Image for post
mainSteps (collapsed)

First step — listInstances

Let’s take a look at the first step . This first step uses an action type aws:executeAwsApi to query the EC2 service for a list of instances filtered by availability-zone, the state of the EC2 instance, and its tags.

Image for post
Image for post

Outputs

As explained earlier, the output from one step can be used as input in the following step. SSM Automation uses a JSONPath expression in the to help select the proper output.

Image for post
Image for post

A JSONPath expression is a string beginning with “$.” used to select one or more components within a JSON element (e.g., the output of the DescribeInstances API call). The JSONPath operators that are supported by SSM Automation are:

  • : This operator selects the value of a specific key from a JSON object.
  • : This operator scans a JSON element level by level and selects a list of values with the specific key. The return type of this operator is always a JSON array. This operator can be either StringList or MapList.
  • : This operator gets the value of a specific index from a JSON array.

In this first step, the output “$.Reservations..Instances..InstanceId” returns a list of InstanceIds filtered by availability-zone, state, and tag.

Second step — SeletRandomInstance

The second step of the document uses an action type aws:executeScript that execute an inline Python script, which returns a random InstanceId from a list of InstanceIds.

Note: The function defined in the handler must have two parameters, events and context.

Image for post
Image for post

The output of script execution is a object on which you can execute the JSONPath selector. In this example, $.Payload.InstanceId.

Third step — verifyInstanceStateRunning

The third step of the document uses another type of action, aws:waitForAwsResourceProperty, that asserts the state of the random InstanceId returned from step two.

Image for post
Image for post

In that step, the selector checks the state of the instances to make sure they are running. I want to make sure all instances are running before messing with them.

Note: As you may have noticed, the input is a StringList, but with a single item, InstanceId. That allows us to easily modify the random function from the previous step to return several items instead, without having to change anything else in the document.

Fourth and Fifth step — stopInstances and forceStopInstances

The fourth and fifth steps of the document use the action type aws:changeInstanceState. As you have probably guessed, these steps change the state of EC2 instances — in that example, to stopped. The input is again the InstanceId from step two.

Image for post
Image for post

Why use stopInstances and forceStopInstances steps?

In the step, the EC2 control plane attempts to gracefully shutdown the selected EC2 instance, allowing it to flush its file system caches or file system metadata. However, sometimes, there may be an issue with the underlying host computer, and the instance might get stuck in the stopping state. That is why the step set Force to true, which forces the instances to stop.

: The second step, , is not recommended for EC2 instances running Windows Server.

: The default timeout value for the aws:changeInstanceState action is 3600 seconds (one hour). You can limit or extend the timeout by specifying the timeoutSeconds parameter.

For more information on EC2 stop-instances API, click here. For troubleshooting errors, click here.

Last step — verifyInstanceStateStopped

Finally, the last step of this document is to verify the state of the instances to be stopped or terminated. This step is arguably redundant since aws:changeInstanceState also asserts on the desired value. However, for the sake of this example, I preferred to make that step explicit.

Image for post
Image for post

Nuff said — Let’s demo this!

For this example, I will assume that you already have some EC2 instances launched in your AWS account with appropriate tags (I use SSMTag:chaos-ready for the demo).

1- Create an IAM role for SSM Automation

By default, SSM doesn’t have permission to perform actions on your AWS resources. Start by creating a role — e.g., with the following policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lambda:InvokeFunction"
],
"Resource": [
"arn:aws:lambda:*:*:function:ChaosAutomation*"
]
},
{
"Effect": "Allow",
"Action": [
"ec2:StartInstances",
"ec2:RunInstances",
"ec2:StopInstances",
"ec2:TerminateInstances",
"ec2:DescribeInstances",
"ec2:DescribeInstanceStatus"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"ssm:*"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"sns:Publish"
],
"Resource": [
"arn:aws:sns:*:*:ChaosAutomation*"
]
}
]
}

It should give you enough to get started with actions calling EC2, SSM Run Command, and AWS Lambda. You should, of course, extend or restrict this policy to your own needs.

2- Fault injection documents

To get you started, I created a few ready-to-use SSM Automation documents.

https://github.com/adhorn/chaos-ssm-documents/

Currently, the following chaos experiments are available — feel free to ask or contribute for more!

1- Randomly stopping instances using EC2 API
2- Randomly stopping instances using AWS Lambda
3- Injecting multiple CPU stresses on EC2 instances using AWS Run Command

To use any of them, you need to create a SSM Automation document using the AWS CLI as follows:

> aws ssm create-document --content --name "StopRandomInstances-API" file://stop_random_instance_api.yml --document-type "Automation" --document-format YAML

After uploading the document, you should see it under the Owned by me tab in AWS System Manager Documents filtered by Document type: Automation .

Image for post
Image for post

3- Executing the fault injection document

Go to the Automation dashboard in the AWS System Manager and click .

Image for post
Image for post
SSM Automation dashboard

Filter the documents by Owner: Owned by me, and you should see your newly uploaded document(s).

Image for post
Image for post

Select the StopRandomInstances-API automation document and click

Image for post
Image for post

Note: If you prefer using the AWS CLI, notice that the console outputs the AWS CLI command execution equivalent.

You enter the input parameters defined in the automation document here, namely AvailabilityZone, TagName, and TagValue (I use SSMTag:chaos-ready). Remember to select the correct role created earlier, in this demo SSMAutomationChaosRole, to allow the execution of the experiment.

Before running the experiment, let’s take a look at my instances currently running in eu-west-1.

Image for post
Image for post

As you can see, I have four instances in eu-west-1a but only three with the correct tag SSMTag:chaos-ready. I will use that information to verify that my filters are working correctly.

Let’s execute the experiment.

Image for post
Image for post

You can follow the execution of each step from the AWS Console. Each step gets a that you can monitor independently. Following is a zoom on

Image for post
Image for post

We can now check and verify that our filters work. And indeed, we have three instances with the correct set of tags in eu-west-1a.

A zoom on the second step shows us the randomly selected instance: i-01f069058c584b2bc.

Image for post
Image for post

Once all the steps completed successfully, we can verify that the correct instance stopped — i-01f069058c584b2bc

Image for post
Image for post

As you can see, our EC2 fault injection worked.

4- Cancelling Executions

You might have noticed the in the execution status page.

Image for post
Image for post

Yes — that’s our Big Red Button right there!

You can only to cancel an execution since SSM cannot guarantee that actions can be stopped or reverted. For example, you can’t undo an activity that is already happening, e.g., stopping and terminating an instance.

As always, with chaos engineering, be extra careful with your experiments — plan carefully!

5 — Continuous Chaos testing

What made Chaos Monkey so unique was that is was continuously running in Netflix’s environment, regularly shutting down EC2 instances, at a regular interval — it wasn’t just a one-off.

Now that you have successfully executed your EC2 failure injection with SSM Automation, you might want to turn that into a continuous chaos test, or continuous verification.

Continuous chaos testing simply means that you regularly execute the failure injection to verify the application repeatedly withstand failures.

Luckily, it is straightforward to do!

You can execute the above SSM Automation by specifying our SSM document as the target of an Amazon CloudWatch event.

Image for post
Image for post
Amazon CloudWatch — Create Rule
  1. Open the CloudWatch console, choose in the left navigation pane, and click .
  2. Choose and specify the recurrence by using the cron format. For demo purposes, I choose to execute the SSM Automation document every 5 minutes, which is represented by the Cron expression 0/5 * * * ? * .
  3. Then click and choose from the Select target type list. Choose the Automation document created above as your target- .
  4. Expand , and enter each of the required values — AvailabilityZone, TagName, TagValue and AutomationAssumeRole.
  5. In the permissions section, to call SSM Automation Execution, or select an existing one.
  6. Click , add a name and a description. Select Enabled state and click . Make sure you add a distinct name with an accurate description; you want to make it apparent what is it a chaos engineering rule!
Image for post
Image for post

You can verify, change, or disable the rule from the CloudWatch console afterward.

Image for post
Image for post

After a while, you should start seeing executions of the SSM Automation document every 5 min.

Image for post
Image for post

As you can see, the last four executions differ and hold the IAM role assumed by the CloudWatch event calling SSM Automation execution.

That’s it — We have successfully built our custom Chaos Monkey using SSM Automation! Hopefully, this blog post will inspire you to start your journey with chaos engineering. Feel free to comment, share your ideas, or submit pull-requests if you want to add new functionalities to this collection of SSM documents.

If you are interested in doing the same experiment but with actions using AWS Lambda, use this document with this lambda function.

-Adrian

The Cloud Architect

Resilient, scalable, and highly available cloud architectures.

Adrian Hornsby

Written by

Principal Developer Advocate, Architecture @awscloud ☁️ I break stuff .. mostly. Opinions here are my own.

The Cloud Architect

All you need to know about building resilient, scalable, and highly available architectures in cloud.

Adrian Hornsby

Written by

Principal Developer Advocate, Architecture @awscloud ☁️ I break stuff .. mostly. Opinions here are my own.

The Cloud Architect

All you need to know about building resilient, scalable, and highly available architectures in cloud.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface.

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox.

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic.

Get the Medium app