Continuous delivery — part 2

Emmanuel Ballerini
Engineering @ Wave
Published in
8 min readFeb 11, 2019

In the first part of this series, we looked at what continuous delivery is and why it matters. Now let’s look at how we transitioned to a continuous delivery process at Wave. Here are the different aspects we are going to cover:

  • How Wave’s delivery process has changed over time.
  • How the deployment process works at a high-level, then what the technical details are.
  • Database migrations, deployments that spans multiple systems and libraries.
  • Where we are going next.

How we made the transformation

About 6 years ago, Wave did not follow a continuous delivery (CD) process. Deployments to production occurred every 3 weeks and required many people and teams to coordinate the effort. Systems had to be taken down for the duration of the deployment. Sometimes a deployment would be skipped and as a result, features and bug fixes would not ship for another 3 weeks. When the decision to move towards a CD process was made by leaders in the engineering organization, there were many challenges from both a business and technical perspective.

  • Business: the executive team had to be convinced that this was the right thing to do. Some of the concerns included being open to deploying bad releases which could break functionality and hurt the business. This was due to the assumption that releasing code more often is riskier. It is, in fact, the opposite since shipping less often means shipping more changes at once and the more changes there are, the riskier it is. The extreme opposite of this is that if a deploy breaks functionality and we only shipped a one-line change, it is very easy to know what caused the breakage
  • Technical: there were lots of manual steps which had to be automated (packaging the code, running the tests, stopping the running app, replacing the current app with the new one, starting the new app)

Going from a 3-week release cycle to deployment on demand required about a year of effort. This was a significant but worthwhile investment: while engineers were working on changing the delivery pipeline, they weren’t working on other initiatives that could have boosted the company’s bottom line. However, we now deploy 22 times a day on average with a lead time as low as a few minutes.

With the ability to now ship new functionality and bug fixes continuously, we send release emails as part of our deployment process. This allows us to keep stakeholders up to speed on the current state of a given product.

Now, let’s dive into how we deploy a change to production.

How we deploy today

When a Wave developer is working on a change, they will test it in a staging environment prior to deploying to production. In order to do this, they can simply log in into Jenkins (our deployment tool) and deploy the branch with the change in a matter of a few clicks. The associated build must first pass, which includes running unit and integration tests, linting and type checking. After some manual testing, typically done by the engineer and the product manager, the engineer will ask for a code review by a peer. The final two steps are to merge the branch to the master branch and deploy that branch to production.

If something accidentally breaks as a result of changes that were deployed, there is a Jenkins job that allows us to easily roll back to a known stable version of the app.

The latest version of the master branch should always be what is in production so that we know exactly what is deployed. This can be a challenge when rolling back to a previous release. After the rollback, the version in production is different than what is in master so one must update the master branch to keep them in sync.

How the deployment works behind the scenes

Our systems run on Amazon Web Services (AWS) and our apps run in Docker. We use Elastic Container Service (ECS) along with Convox for container orchestration. In order for our services to be highly available, they must each run on multiple nodes, which means there are many containers running each service. Container orchestration refers to how these containers get managed. There is typically a desired state (sometimes referred to as “configured”) and a running state. Container orchestration tools ensure that the running state matches the desired one (for example, if the desired state is to have 2 running instances of a given service and there is only 1 running, Convox will spin up a new instance of that service to match the desired state). ECS allows us to do zero-downtime deployments when a new version needs to get deployed.

For example, say we have 2 instances of our app running in v1, each on a separate EC2 instance, in front of our load balancer. When we are ready to deploy v2 and push the “build” button, the following happens:

  • The repository which contains the change is cloned
  • An archive of the app is created
  • The appropriate base image from our Elastic Container Registry is pulled (ECR is where our Docker images live): it contains the relevant OS, tools and packages needed for the application or service to run (e.g. python or ruby runtime, nginx)
  • A new image is created based on the base image and the code from the repository. Once built, this new image is pushed to our registry (ECR). This is the version 2 of our app. It is tagged with a new release name
First part of the deployment process
  • This triggers Convox to start a new Docker container based off of the new image. This happens on a third EC2 instance, already up but not running this specific app. Our v2 app is now started but is not serving traffic yet
  • One of the 2 containers that run v1 is taken out of the load balancer (once requests have all been processed) while the new container is added to it and it is now serving traffic. That container is stopped. For a short period of time, the system serves v1 and v2
  • On the EC2 instance that used to run v1, a new container is spun up and v2 starts.
  • The container running v2 is added to the load balancer and the one running v1 is removed
Second part of the deployment process

At this point, the app is in v2. This type of deployment is often referred to as a “rolling update” as existing containers are progressively replaced by new ones (this is by opposition to “blue-green” deployment where the new version of the app would be deployed to a separate cluster and then a switch gets flipped, directing all live traffic from the old to the new cluster).

If the changes include database migrations, we will not apply them automatically. They could create long lasting locks on the tables involved and if the migration takes too long that can cause transactions to fail, resulting in errors for our users. In order to do a zero-downtime migration, the general approach is to create new tables or columns in the database as a first step, then deploy the code that relies on the updated schema. Deleting a column would require executing these steps in the reverse order. Either way, the new version of the code has to be compatible with the old version of the schema and vice versa. More details can be found in this video.

While we can deploy at any time of the day and night, we don’t want to take unnecessary risks and deploy sensitive code right before going home. We typically deploy during normal business hours when most engineers are in the office. Generally speaking, communication is key: if we are about to deploy something potentially risky, we let other team members know so that they don’t get surprised should something go wrong.

How we release features in a controlled manner

One of the advantages of using CD is to ship small changes often. If we work on a large feature (or new product), keeping it in a separate branch (i.e. feature branch) until it’s ready presents integration risks and defeats the purpose of CD. Instead, we ship changes but hide them behind a feature toggle so that it is inaccessible to most users but we can still test it internally.

We have a couple of ways of doing so. For our Django apps, we can use Waffle. Waffle has a set of database tables where it keeps track of feature statuses.

We also have an internal system that lets us “segment” users based on their user id. This works very well when we migrate users from one (old) system to its replacement. As a result, some users see and use the old product while others use the new one.

How we orchestrate multiple deployments and internal libraries

Our codebase is organized in different repositories independent of one another. This allows us to deploy one application without having to worry about the other ones. Occasionally, changes span multiple systems, which means we have to run many deploys. We need to sequence those deployments in such a way that the overall system continues to work without interruption. For example, if we add a new endpoint to a system A that a system B needs to call, we need to deploy system A first so that when we deploy system B, the new endpoint is available.

Our internal libraries are versioned so releasing a new version does not typically affect systems because they are not aware of the new version. Each system that depends on that library can then bump to the latest version of that library.

Where do we go next

We do CD at Wave but some organizations take it even further by doing continuous deployment. The difference between the two is that while continuous delivery allows us to deploy on demand, it still requires manual intervention to push the code to production. Continuous deployment automates everything and every commit to trunk/master triggers a deployment to production. This has the advantage of not requiring manual intervention, but does require more checks and validation in order to make sure what is deployed doesn’t break anything.

At the time of writing, we don’t do continuous deployment but this could change in the future as we always reevaluate and improve our processes.

Summary

Many companies have made the transformation from traditional, risky and infrequent delivery to the continuous model and so have we at Wave. While it was a challenging effort, it has paid off and we benefit from that investment many times a day to reliably deliver to our customers high quality software faster.

Acknowledgments

Many thanks go to the following people for the many reviews and advice: Joseph Pierri, Matthew Montreuil, Ryan Wilson-Perkin, Nick Presta and Erica Pisani.

--

--