Testing Spot Instance interruptions with AWS Fault Injection Simulator

Chaos engineering on AWS

“Without data, you’re just another person with an opinion.”
― W. Edwards Deming

Couple years ago, I wrote about Operational Excellence (OE) and discussed the three interconnecting elements to operate the technology we build successfully. First, you need great tools. Second, you need complete processes. Third, and arguably the most important one, you need to have the right culture.

I mentioned that OE resembles a habit, a philosophy, a mindset — one that embraces problem-solving, one that values continuous improvement, and one that aims to exceed goals consistently. It’s a way to anticipate, address, and effectively respond to issues. And, for Amazon, it also means doing all of that at a significant scale, where significant can mean thousands of people and millions of servers across the globe.

Above anything, a culture focused on Operational Excellence means that you don’t speculate. You don’t speculate about the security, the performance, the resilience, and the health of your service — or anything else for that matter. You use data. Data alone lets you understand, and verify, what happens to your application when the environment in which it operates suddenly changes.

That’s why we launched AWS Fault Injection Simulator (FIS) in March 2021.

EC2 Spot Instance interruptions with FIS

With FIS, you can run experiments in which you inject faults into underlying resources like EC2 instances, an EKS cluster, or a VPC, to test how your application responds. It helps you understand the resilience of your application under various conditions. You get data that you can use to evaluate, understand, and measure changes to your architecture.

A few weeks ago, we made testing EC2 Spot Instance interruptions with FIS easier by bringing FIS experiments right in the Spot console. Spot Instances are spare (EC2) compute capacity that are available to you with no commitment and at steep discounts — up to 90% discount compared to On-Demand prices.

The flip side of the big discount is that spot instances can be interrupted with a two-minute notification — EC2 Spot Instance Interruption Warning— and terminated whenever spare capacity is no longer available. The 2 minute interruption notification is delivered via the EC2 instance meta-data service as well as Amazon EventBridge.

Customers build resilient applications leveraging Spot Instances by automating a response to this two-minute notification. Examples include draining containers, draining ELB connections, or post-processing.

Customers can also rely on EC2 Auto Scaling to add and remove Spot instances as needed. Spot instances are spare capacity, thus its supply and demand vary. By being flexible with your Spot Allocation strategy, you increase your chances of getting the desired capacity, and decrease the potential number of interrupted instances in case EC2 needs to reclaim Spot Instances. In case of interruption or to scale-out, Auto Scaling will launch a replacement Spot Instance based on available spare capacity at that time.

If your application can handle a Spot Instance interruption seamlessly, you get the benefit of the Spot pricing without disruption to your customer experience, even at scale for applications that require 500K concurrent cores.

But instead of guessing and speculating on whether your application can handle that interruption, FIS lets you verify it. In other words, the Spot support in FIS gives you a data-driven answer to the question

“Is my application’s customer experience resilient to Spot instances interruption?”

In an OE focused culture, data drives every decision — whether the decision is business, operational, or engineering. With FIS, customers get data from experiments to prove out assumptions about their cloud architecture.

Spot Instance interruption from the EC2 console

To trigger a Spot Instance interruption from the EC2 console, navigate to the Spot Request section in the EC2 console, select a Spot Instance request, navigate to Actions, an select Initiate interruption.

EC2 Spot Request section — Select Actions and click Initiate interruption

FIS requires permission to conduct the Spot Interruption on your behalf. However, by selecting the Default role, FIS will create a default role for you. Click Initiate interruption.

Initiating a Spot Instance interruption first sends a 2-minute notice, and then interrupts the Spot Instance.

That is it! It is that simple. In just couple clicks, you can trigger a Spot Instance interruption and verify how your application’s automated response behaves.

Behind the scenes, this feature uses FIS to inject the interruption. You can find the details of the FIS experiment by navigating to the FIS console.

If you have never tried Spot instances or FIS, you can use the tutorial below. It will guide you through the process of launching Spot instances and using FIS to initiate a Spot interruption action.

So, buckle up!

Tutorial: Using FIS to test Spot interruptions

Launching Spot Instances

In order to launch a Spot Instance with the AWS CLI, you have two choices. You can use either the (a) RequestSpotInstances EC2 API or (b), the RunInstances EC2 API. Of course, you may also submit Spot Instance Requests through the AWS console. However, in this tutorial, we will focus on the CLI experience.

Note: the instance type may not be available to you when you follow this tutorial, in which case you will get an InsufficientInstanceCapacity error. Adjust the instance type accordingly.

(a) Using the EC2 RequestSpotInstances API:

First, create a file (e.g., launchSpec.json) with the following example specifications:

{"InstanceType": "c1.medium", "ImageId": "ami-0c4e4b4eb2e11d1d4"}

Then call the EC2 API as follows:

aws ec2 request-spot-instances --spot-price "0.5" --instance-count 1 --type "one-time" --instance-interruption-behavior "terminate" --tag-specifications --launch-specification file://launchSpec.json 

You can tag the EC2 instance launched out of the Spot Instance request as follows:

aws ec2 describe-spot-instance-requests \
--query "SpotInstanceRequests[*].{ID:InstanceId}"

When you get a response back, save the Instance ID and use it to create the instance tag. We use the tag below to scope down the permissions and to target resources in the FIS experiment.

aws ec2 create-tags \
--resources <instance-id> \
--tags Key=FIS-Ready,Value=spot

(b) Using the EC2 RunInstances API:

The advantage of that API is that you can also automatically tag the Spot Instance without requiring any extra calls.

aws ec2 run-instances --image-id ami-0c4e4b4eb2e11d1d4 --instance-type c1.medium --placement '{"AvailabilityZone": "us-east-1a"}' --count 1 --instance-market-options '{"MarketType":"spot", "SpotOptions": {"MaxPrice": "2.00", "SpotInstanceType": "one-time", "InstanceInterruptionBehavior": "terminate"}}' --tag-specifications 'ResourceType=instance,Tags=[{Key=FIS-Ready,Value=spot}]'

Once your Spot Request is out, by using either the (a) or the (b) method, you can describe the request using the following:

aws ec2 describe-spot-instance-requests

IAM role for the FIS Experiment

When creating an Experiment template in FIS, you must provide a roleArn. The provided role must have the required permissions to perform the actions in the Experiment template. An example role able to perform the Spot interruption action looks like this:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "InterruptSpot",
"Effect": "Allow",
"Action": "ec2:SendSpotInstanceInterruptions",
"Resource": "arn:aws:ec2:*:<your-account-id>:instance/*"
}
]
}

The role must also have a trust policy like this:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": [
"fis.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}

This is the simplest kind of role that would work. It explicitly authorizes the new Spot interruption action, and scopes the action to only instances launched in this account. Like with all IAM-based AuthZ, you can scope it down further with condition keys as follows:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "InterruptSpot",
"Effect": "Allow",
"Action": "ec2:SendSpotInstanceInterruptions",
"Resource": "arn:aws:ec2:*:<your-account-id>:instance/*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/FIS-Ready": "spot"
}
}
}
]
}

This further restrict the role so that it can only target instances tagged with FIS-Ready = spot.

The FIS Experiment

You first need to create an FIS Experiment template containing the Spot interruption action. Creating templates in plain JSON is, in my opinion, pretty handy (as in Experiment template below) since you can provide it to the FIS API as an argument.

Create a file (e.g., spot.json) with the following template:

{
"description": "EC2 Spot",
"targets": {
"EC2InstancesToInterrupt": {
"resourceType": "aws:ec2:spot-instance",
"resourceTags": { "FIS-Ready": "spot" },
"filters": [
{
"path": "State.Name",
"values": ["running"]
}
],
"selectionMode": "ALL"
}
},
"actions": {
"InterruptSpotInstance": {
"actionId": "aws:ec2:send-spot-instance-interruptions",
"description": "Spot interruption",
"parameters": {
"durationBeforeInterruption": "PT2M"
},
"targets": {
"SpotInstances": "EC2InstancesToInterrupt"
}
}
},
"stopConditions": [
{
"source": "none"
}
],
"roleArn": "arn:aws:iam::<your-account-id>:role/<your-new-role>"
}

The interruption time is determined by the specified durationBeforeInterruption parameter. In this case, two minutes after the interruption time the Spot Instances are terminated or stopped depending on their interruption behavior — terminated in our case. A Spot Instance that was stopped by FIS remains stopped until you restart it.

When you define the targets in the Experiment template, you can choose specific AWS resources (of a specific resource type) to target in your account. Or, you can let FIS identify a group of resources based on the criteria that you provide, for example:

Resource IDs — The resource IDs of specific AWS resources.

Resource tags — The tags applied to specific AWS resources.

Resource filters — The path and values that represent resources with specific attributes.

Resource parameters — The parameters that represent resources that meet specific criteria.

Note: More general info about targets is available in the official docs.

In our Expriment template, we use resourceTags to identify the Spot Instance created above with the tag FIS-Ready = spot .

You now have everything ready to go. You can create the Experiment Template and run some Experiments.

First and foremost, ensure you create the IAM role as specified above. You will also need to make sure you have at least one Spot instance running in the region you’re using.

Then, start by creating the Experiment template:

aws fis create-experiment-template --cli-input-json file://spot.json

When you get a response back, save the Experiment Template ID.

Then, all that’s left is to start the Experiment itself:

aws fis start-experiment --experiment-template-id <saved-template-id>

Immediately after the experiment is executed, the target instance receives an EC2 instance rebalance recommendation. If you specified durationBeforeInterruption, there could be a delay between the rebalance recommendation and the interruption notice.

Again, ensure you save the ID in the response as this is the Experiment ID you’ll need to check the status of the Experiment after it’s begun. You can do that like so:

aws fis --region us-east-1 get-experiment --id <saved-experiment-id>

Note that all of these steps can also be performed in the FIS console, which will guide you through the Experiment Template creation process, and allow you to start the Experiment and monitor its progress.

Verify that the Spot Instance was interrupted by the FIS experiment

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/

From the navigation pane, open Spot Requests and Instances in separate browser tabs or windows.

For Spot Requests, select the Spot Instance request. While the initial status is fulfilled, after the experiment completes, the status changes to instance-terminated-by-experiment.

For Instances, select the Spot Instance. While the initial status is Running, two minutes after you receive the Spot Instance interruption notice, the status changes to Shutting-down and then Terminated.

That’s all, folks. I hope you enjoyed this post. Please don’t hesitate to share your feedback and opinions.

Adrian

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adrian Hornsby

Adrian Hornsby

5.2K Followers

Principal System Dev Engineer @ AWS ☁️ I break stuff .. mostly. Opinions here are my own.