Deployment rollback in a containers world (AWS ECS edition)

5 min readDec 1, 2018

Image by Florian Timm https://www.flickr.com/photos/floriantimm/15295809966

In this article we are going to see how to make the most out of using containers on AWS ECS gaining the ability to rollback a whole cluster within seconds to an arbitrary state (which usually will be the previous one).

Ancient deployment tools (please, forgive my poetic license), like fabric scripts for Django or capifony for Symfony for example, have some way to rollback a deploy. Usually this leverages capabilities of filesystems of making links: you issue the command which connects to one or more hosts and a symbolic link got updated to a folder with a new revision of you code.
If you are using containers, probably (and hopefully), you haven’t ssh access to instances they run on, let alone to install sshd inside the containers themselves (I’m going to explain why this is a bad idea, stay with me). This is a good thing for, at least, two reasons. Having shell access to servers is really dangerous for your infrastructure, it exposes to potentially fatal errors because possible actions are limited only by the fantasy of humans, which appears to be quite large. The worst part though, is that these actions are not tracked, so, it’s not possible to be sure of what and when happened to your infrastructure when problems arise and an analysis must be performed.

Immutable infrastructure

Containers are meant to be used as immutable deploying units: once an image is built, it should not change during all its lifetime. A very important best practice to follow is that the image definition (the Dockerfile) should be tracked in your source control system, as near to the code that lives inside it as possible. When a modification to the image is needed, a pull request to the code repository must be done which should follow all the best practices of software developing (e.g. tests must run and a code review should be performed). Doing so, your git history will keep track also when certain system library is upgraded for example.
Here you will find further readings on this topic.

A tagging strategy for your images

A fundamental step to proceed towards our final goal is the ability to distinctively identify different image versions. If you are keeping image definitions in the same git repository of your code, you got this almost free, just add the short version of the commit hash to the image tag.
If multiple images live within your code, it could be a good idea to use a single image repository to avoid the burden to manage multiple ones, in this case you could use a tag composed by name of the image and its version, e.g.:

my-repo.tld/<my-service>:<image-name>-<image-version>

It’s also important that you do not delete images that you could want to rollback to from their repository.

Deploying versioned images on ECS

Ideally, updates to task definitions should be performed automatically, triggered by new commits on master branch. An example would be to have e deploy pipeline with a build and a deploy stages. The build one builds the images and pushes them to the image repository, the deploy stage updates ECS task definitions with new image versions.
For example, this could be the script to execute in the build stage:

Build stage example script.

On the deploy stage you could use aws cli to update a CloudFormation stack previously created that holds the ECS stuff:

Deploy stage example script.

CloudFormation template to continuously deploy your images.

The ECS stack should be created before it can be updated in the deploy stage, this could be done with the cli:

Stack initialization.

Notes

for the sake of simplicity, the CloudFormation template which creates other resources that your application will need is not showed (e.g. the ECS cluster at line 31 the should exist)
the example uses a single image but you can extend it to multiple ones adding as many ECS services as CloudFormation’ limits permit;

Perform a rollback

Soon or later, the last modifications that land to production will present some problems not spotted by tests or QA. In this case the most obvious way to got is to perform git revert and push to master. The problem is that you have to wait all stages of the CI/CD pipeline to complete, this could takes minutes and the process has high chances to fail because tests may have been bypassed to gain time.

A different approach

A quicker and safer way to rollback is to update ECS services to use previous task definitions that are assured to work. This could be done using the web UI or with aws-cli.

The problem

The downside of using stack updates is that, at the time of writing, CloudFormation marks as inactive the previous task definitions when updates them with new image versions, this makes really difficult to rollback your services. The only way to go is to create a new task definition with the only change to be the image tag to version that we want to rollback to, and then update the ECS services to use it. This is a very tedious process that is all but quick to do safely when an emergency situation imposes a rollback.
Luckily enough there is a possible a solution.

Meet ecsundo

If you have followed the advise of tagging your images and make deployments updating ECS services with tasks with new versions, ecsundo can help you. It can act on a whole cluster or on a specific service.

It can automatically handle rollback to inactive tasks.

Lets see some usage examples:

rollback all services in a cluster to previous version:

$ ecsundo cluster <cluster-name>

rollback a service to the previous version:

$ ecsundo service -c <cluster-name> <service-name>

ecsundo can also save all task versions in a cluster and restore them in a later moment. To make a snapshot of all services versions in a cluster:

$ ecsundo cluster snapshot <cluster-name>

restore a snapshot of a versions of all services in a cluster:

$ ecsundo cluster restore <cluster-name>

The on line help (and the README) will give further details:

$ ecsundo help

Summary

We have shown the importance to have an immutable infrastructure and to use a tagging strategy for your containers images. We have introduced ecsundo which can not only rollback all services in an ECS cluster to their previous version, but also make cluster “snapshots” to bring them to an arbitrary point in time previously recorded.