Let’s Rebuild AWS EC2 (Part 1)

Looking deeper into some of the nuances of ec2 (idempotency, retries, conditional writes, time to live, etc) and extrapolating them to other system designs

10 min readApr 18, 2022

Let’s pretend we are building a replica of ec2 (or at least part of it). We are focusing on the run instances functionality. Specifically, we will be considering building this command for use from within an iac orchestrator (such as terraform or cloudformation).

NOTE: this is meant to be demonstrative of how ec2 might be designed and to show concepts we can apply when designing our apis, and while it is close to how ec2 works, it might differ slightly (for example, having a simpler request so we can focus on system design) rather than function exactly the same as ec2 does.

Special Considerations

Before we even get to launching the instance, there are a few special considerations that I’d like to draw your attention to; please keep these in mind while reading (as they will help solidify why we design the system the way that we do):

Remember that everything fails all the time, so our requests must be retryable

This allows clients to blindly retry when they do not recieve a 2xx or 4xx response

We must launch exactly one instance for a given request and its associated retries

This helps manage cost and ensure we don’t launch extra instances that aren’t tracked via iac

We must return the same response, including the same reservation id for both an original request and its associated retries

This lets the iac tool track the instance it has launched for updates and deletions (for example, keeping track of the instance by reservation id in the tfstate file). This is not possible if retries do not return the same reservation id.

System Overview

Client-Side Flow

The client populates the request with information about the instance(s) to launch. For simplicity’s sake, our example only takes in the instance type and a client token (see an example request below), but the actual run instances api request has much more information.

{
  "instanceType": "t2.large",
  "clientToken": "27d49fc2-996c-41b7-b945-5da9f6ddb2a5"
}

When making a request, the client generates a unique id for use in the clientToken field. The client then makes a launch instance request, passing the clientToken and instance type. If the the client does not get a 2xx or 4xx response, the client retries the same request (using exponential backoff and jitter) with the same clientToken until it receives either a 4xx or 2xx response or exceeds a maximum number of retries.

Server-Side Architecture

On the server-side, we have a REST (post) endpoint exposed via the api gateway, which accepts a request to launch a new instance. This api gateway routes traffic to a lambda backend, which performs basic validation that cannot be done by the api gateway based on the open api specification. If the request is invalid, it returns a 400. Otherwise, it generates a unique reservation id for the instance to be provisioned, and then it attempts to save the request and the to dynamodb using a conditional update. If the updateitemrequest succeeds, then we know this is the first time handling this request, and we know the reservation id is the one we just generated.

If the write fails with a ConditionalCheckFailedException, then we know that this is a retry for a request that we already initiated the instance provisioning process. We query the table to get the reservation id we generated for the original request with the same clientToken, and include this id in the response.

{
  "reservationId": "1620c4cd-1b34-4af3-a89b-815a9d5a650f"
}

NOTE: the actual ec2 implementation also checks if you pass the same request for the same clientToken. If you pass the same clientToken for different request bodies (within a given period), it will return an error, however I ommitted this here for simplicity’s sake 😜

After this, we would likely have a lambda that subscribes to the change stream for inserts (using event source filtering to not be triggered for deletes), and actually provisions the instance (not discussed in this article, but maybe I’ll write about it in a later one 😉).

Optimization

We’ve configured our dynamodb table with a ttl attribute (named ttl). When we write items to the table, we specify a ttl for five minutes later than the current time. Dynamodb will automatically delete items (approximately) five minutes after we write them, which save us money on storage costs and keep our table size smaller (easier to debug). You can see this done in our sam template here.

Nuanced implementation details

When I first started, I thought I would use a combination of two features (conditional update and return values), to execute this in a single call, but I had to change plans. This is because when a condition is violated for a conditional update in dynamodb, the client throws a ConditionalCheckFailedException and does not return the existing values. Hence, we have to try to perform the conditional insert to dynamo, and perform different actions based on whether or not the write attempt throws a ConditionalCheckFailedException.

If it writes successfully, this means that it is our first time handling this request (i.e. it is not a retry of an original request that already reached us with the same clientToken), and we can return the generated reservation id. If it fails with a ConditionalCheckFailedException (only catch this specific exception, not all exceptions), then we know this is a retry of a request that we have already handled. Thus, we must make a getitemrequest to retrieve the previously generated reservation id. However, this design can lead to latent bugs (as it is a form of fallback, which we should look to avoid whenever possible). To minimize the drawbacks associated with fallbacks, we must do the following:

emit individual metrics on both retries and original requests, and set separate alarms on both statistics
routinely excercise (test) all code branches to avoid latent bugs associated with rarely executed fallbacks

Metrics

In order to properly operate our system, we should create the following metrics using emf. These metrics should live in the namespace replicaOfEc2, have no dimensions, and have a single attribute clientToken (which has the clientToken passed in the request). This can then be used with cloudwatch logs insights to gather information on a specific request when needed while keeping costs low (there are tons of benefits of using this, and I highly recommend you read this article giving more details). You can use the count aggregator to see how many requests of each type (originals versus retries) are being sent to your service; and you can use the p99 aggregator to look at the 99th percentile latencies for a given action.

insertSuccessLatency — time taken to insert on a non-retry request
insertDuplicateLatency—time taken to attempt an insert on a retry (and fail the condition check)
retrieveExistingLatency — time taken to retrieve the existing reservation id
nonRetryTotalLatency — total time taken for a non-retry request
retryTotalLatency —total time taken for a retry request

You can (and should) use these metrics for separate alarms and display then on your dashboard

Exercising all code branches

It is well-documented that having fallback (or branching in general) in distributed systems can result in bad consequences. However, in cases where we must have branching, we can minimize the chance of latent errors by continually exercising all branches within our code using cloudwatch synthetics (canaries). I’ll write another article on this, but the point you should take away from this is to ensure that you are regularly excercising any code branches you have to discover bugs before they become latent and bring down your system.

Handling race conditions

In addition to using metrics/canaries to avoid fallback, we also need to pay special care to avoid race conditions, which can result in two instances being launched. This might occur if we make a get item request first, then if it does not exist, make a put item request. To get an idea of how this could occur, please consider the following situation

A request goes over network, hits considerable latency (say an extra second delay), it eventually reaches the server, it checks and no item exists for token. At this point, the request is still going to be processed, but the client has timed out, so it sends a retry. The retry doesn’t experience much latency, so it reaches before the original request has written to the database, so it makes a updateitem request with a random id. At around the same time the first request will make a updateitem request with a different id. This results in two items being written to the change stream for the same clientToken, which will result in two instances being launched for the same request, which is a clear violation of our requirements.

In order to avoid this double write, we follow the architecture described above in this article, with a conditional write performed first, then if it fails, performing a fetch.

See it in action

In order to run the application:

clone the repo

git clone git@gitlab.com:connorbutch/ec2-clone.git

run the provided sam template to provision the infrastructure and copy the output api gateway url

sam build && sam deploy --guided

Deploying your own copy of these stack is easy!

run the cucumber tests that don’t involve retries
look at the associated report, which shows the cucumber tests results
look at the cloudwatch dashboard —notice that this only shows metrics for original requests (the metrics for retries are not present yet)

cd launch-instance/ && npm run cucumberNonRetry

Here is an example of the report that is generated and automatically opened after running tests (notice the @NonRetry scenario tag)

now run the cucumber tests associated with retries and view the associated test results
now open the cloudwatch dashboard — notice that it now shows metrics for retries

cd launch-instance/ && npm run cucumberWithRetry

Here is a video running the retries now — notice the dashboard now has stats for retry statistics (bottom row)

Here is an example of the report that is generated and automatically opened after running tests (notice the @Retry scenario tag)

Other applications

As mentioned above, the ideas in this article can (and should) be applied to other contexts as well. Consider, for example, a crm solution similar to salesforce in which an individual can be created/registered using a variety of different identifying information (say ssn, drivers license number, etc). In this case, the api can accept a post request containing pertinent information (i.e. a specific field such as ssn) and return an id for the party inside of the crm. This could be used with single table design, where the partition (primary) key is a combination of a prefix concatenated with an underscore then a value (i.e. an ssn of 123456789 could be stored as ssn_123456789 and a drivers license of b111111111 could be license_b1111111)… but that’s another topic for another article 😉

{
  "ssn": "123456789"
}

creating a party based on ssn would have a request like this

{
  "driversLicenseNumber": "b111111111"
}

creating a party based on drivers license would have a request like this

These items could both be stored in the same table as shown below

An example using single table design showing two different parties being created — one identified by drivers license and one identified by ssn

This could also be used for making cqrs and event sourcing idempotent (since, by default, event sourcing is not idempotent).

Conclusion

In this article, we “recreated” a portion of ec2’s launch instance capabilities; and in doing so, we have seen that that there are many subtle implementation details that are important when creating a highly-usable service. We were reminded that all apis must be idempotent, and client sdks should automatically retry non success responses that do not contain validation errors using exponential back off and jitter. We also saw that in order to provide our clients (specifically those using iac) with the best experience possible, we must go beyond idempotency, and also return the same response for any given request and its associated retries. This allows clients to track the ec2 reservation id for comparison when determining what actions to take upon the instance. We implemented this by having a client pass a unique clientToken for a given request (and its associated retries). We then generated a unique reservation id, and attempted to save using dynamodb conditional writes with the clientToken as the partition key. If it suceeded, then return the response to the client. If it failed, then go ahead and retrieve the existing item and return that in a response to the client. While this implementation may seem specific to ec2, we can extrapolate these design patterns and use them when designing a wide variety of systems.

Let’s Rebuild AWS EC2 (Part 1)

Looking deeper into some of the nuances of ec2 (idempotency, retries, conditional writes, time to live, etc) and extrapolating them to other system designs

Special Considerations

Remember that everything fails all the time, so our requests must be retryable

We must launch exactly one instance for a given request and its associated retries

We must return the same response, including the same reservation id for both an original request and its associated retries

System Overview

Client-Side Flow

Server-Side Architecture

Optimization

Nuanced implementation details

Metrics

Exercising all code branches

Handling race conditions

See it in action

Other applications

Conclusion

Related Links

Connor Butch / ec2-clone · GitLab

GitLab 15.0 is launching on May 22! This version brings many exciting improvements, but also removes deprecated…

Further Reading

Making retries safe with idempotent APIs

At Amazon, we often see patterns in our services in which a complex operation is decomposed into a controlling process…

Timeouts, retries and backoff with jitter

Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors…

Understanding DynamoDB Condition Expressions

If you're working with DynamoDB, you're likely to rely on Condition Expressions when manipulating items in your table…

RunInstances

Launches the specified number of instances using an AMI for which you have permissions. You can specify a number of…

Expiring Items By Using DynamoDB Time to Live (TTL)

Amazon DynamoDB Time to Live (TTL) allows you to define a per-item timestamp to determine when an item is no longer…

Provision Infrastructure As Code - AWS CloudFormation - Amazon Web Services

AWS CloudFormation Scale your infrastructure worldwide and manage resources across all AWS accounts and regions through…

Terraform by HashiCorp

Blog post Read the 1.0 launch blog post Terraform is an open-source infrastructure as code software tool that provides…

Written by Connor Butch