All Aboard the Deploy Train, Stop 3 — Peas anyone?
We’d now taken some time to research our options and had done the necessary work to prepare Jenkins to meet our needs, so we went back to determining what technology to choose for our new deployment system.
In the end we went with running processes inside Docker containers on top of AWS’s “elastic container service,” or ECS, because:
- Amazon managed the scheduler for us, removing the need for us to have to do that work. We had tried to build our own Kubernetes cluster but it was more work to build and maintain than we had time for.
- We were already using Docker as part of our CI process and had existing infrastructure and code already setup to support that.
- This combination met our requirements for our new system.
Validating ECS
But we still wanted to do some further validation of our choice to use Docker and ECS before committing to it for our production traffic.
We decided to take advantage of the work we’d done on our Jenkins setup, having identical pull request (PR) and dev builds. To build a tool that would allow developers to take a pull request and deploy it to an environment customers couldn’t access but one that was hooked up to our production infrastructure.
This would allow them to so test the new code works in a realistic environment before our customers see it.
This would allow us to test deploying Docker images to ECS, work out how to get Conversocial running inside of Docker, and give others a chance to play with deploying artefacts and interacting with Docker containers to debug what’s happening.
We decided to call this tool Pealiver after the Poddington Peas. Developers pick a pea to deploy their PR to, we spin up a Docker container running the image from their Jenkins build, and ECS schedules it onto one of our servers in the cluster.
This allowed us to verify that ECS worked as advertised and provided a stable platform for scheduling and running containers.
We were also able to learn a lot about how we configured the host servers, and our Docker containers via what AWS calls task definitions, a blob of JSON that maps configuration options in AWS to Docker run command parameters.
Pealiver has now become a critical tool for ensuring we can thoroughly test code before it gets released to customers and we’ve had to grow the infrastructure supporting it to keep up with demand for Peas.
What should Fat Controller look like?
Confident that our choice of Docker and ECS to power the next generation of deployments was correct we drew up plans for what it would look like:
- User fills in a form selecting:
What PR they wish to deploy.
What processes they wish to update the code running on.
- User joins a queue of people waiting to deploy.
- Once the user reaches the top of the queue their code is merged into the master branch.
- Jenkins notices is informed about the merge by a webhook from Github and kicks off a build.
- The tool waits on Jenkins to finish the build and report the results.
- If the build passes, canaries are deployed for each of the processes being updated and the user is poked in Slack and asked to check that the canaries are working.
- If the build fails you can either retry the build (if you suspect it was a flaking test) or roll back, and the tool will unmerge your code and you can fix it whilst others continue with their deploys.
- The user can then proceed with the deploy or choose to roll back.
- If they choose to proceed we perform a rolling update, whereby we add some processes running the new version of the code and then scale down some of the old ones in small batches until all of the processes have been replaced with the latest code.
- Once the update is complete, the user can confirm their deploy is over or it will time out and autocomplete if they do nothing.
We also have emergency options that allows the check for the build being green to be skipped, for quickly getting code out in emergencies, and the ability to jump to the front of the queue rather than join at the back for similar uses.
Each of the steps would come with a rollback option which would undo all of the steps preceding it.
We mocked up what this flow would look like in Invision, a tool we used to produce quick low fidelity mockups of what the user flow would be and showed it to bunch of people to get their feedback.
The feedback was positive. It was essentially an iteration on what we already had, tidying up some of the rough edges, and providing solutions to the problems that had been highlighted.
So we moved on to drawing up a detailed plan of what and how we were going to build it.
We wanted to provide a wrapper around the ECS API to make deploys easy for developers to use and meet our requirements. This would take the form of a web application that presented the queue of people already waiting to deploy, a form for creating a new deploy, details on in-progress deploys and details on historical deploys to aid debugging issues.
All of the interactions with the ECS API would not take place directly through the web requests though to avoid issues with timeouts that we’d seen before in Kettle. The web application requests work to be carried out by our worker process which sits in a loop checking if there is anything to do.
Handling errors correctly was critical for a tool that changes the code running our production application so we mapped out the flow as a flow chart.
This provided us with a clear view of what we were going to build, as well as a reference to refer back to when we were implementing to clarify what was supposed to happen in every situation.
Follow me on Twitter @AaronKalair
Part 1 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-1-anyone-for-tea-a5c12b984ed9
Part 4 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-4-iterating-on-the-ui-b26e1962083f