All Aboard the Deploy Train, Stop 4 — Iterating on the UI

6 min readMar 30, 2018

Last time we talked about how we validated our choice of ECS and Docker as the technologies for the future of deployments and then started on building the new tool. This time we’ll look at how we tied up some loose ends for running containers in production and got the tool ready for production.

Building the Infrastructure

After a few months of development we were closing in on completion and began to look at what we’d need on the infrastructure side to make this work.

We wanted two ECS clusters: one for web-facing processes that would handle incoming connections from our ELB, and another for backend processes that don’t handle connections directly from the Internet.

These would run the ECS optimised AMI from Amazon (an operating system image supplied by Amazon and specifically optimised to run Docker containers with their orchestrator), we then use Chef to install a minimal set of supporting services needed onto them, such as our SSH keys, NTP and system level monitoring.

Logging

Next we had to work out what to do with the logs. In the old setup the processes simply logged to a file on disk and if you wanted to read them you’d SSH in to the appropriate server and look at the file. However, in the new Docker world this wasn’t an option as the containers go away along with their logs within 10 minutes of dying.

We decided to use our existing FluentD and Kibana setup to pipe the logs from our containers into Kibana and also off to S3 for long-term storage.

Testing in Staging

With a working version of the tool and some staging infrastructure to test deploying to we set about putting it through its paces on staging,

We started testing with the simplest case possible: a single process, and used that to iron out the small niggly bugs. Fortunately, there weren’t many and we were able to quickly move on to stressing the tool with more realistic deploys.

This highlighted some issues we had with hitting the ECS APIs too hard, exceeding their rate limit as well as spamming our own servers running Fat Controller with too many requests for updates to the state of the system.

Fortunately, many of the things we asked the ECS API for were immutable (e.g. task definitions) or did not change frequently so they were prime candidates for caching. We spun up an Elasticache Redis cluster and were able to dramatically reduce the number of API requests we made, fixing our rate limit problem.

UI Testing

As well as this scaling work, we were doing user testing with developers to work out any usability issues. We identified three major issues:

No one really understood the affect the “pause” button had on the deploy process.

It prevented any new deploys from taking place but had no effect on the currently running deploy.

Its placement in the top right hand of the app suggested that it would also pause the currently running deploy to pretty much everyone we showed it to and so after some discussion we moved it to above the deploy queue, and the feedback from this was much more positive.

With the pause button in the upper right, it was unclear what effect it had

The pause button sits under the queue making it more obvious what effect is has

The form for submitting a request to deploy new code has a section where you choose what processes you wish to deploy to, the design which worked fine for a development environment with a small number of processes was clunky and hard to read with a more realistic dataset.

We iterated on the design a few times and came up with one that was easier to scan through and more consistently laid out the process names.

An early version of the process selector where nothing lined up and it was hard to read

The final version of the process selector is far easier to read and use

The New Deploy form itself went through numerous iterations later on. We were able to improve the usability of this part of the app significantly, making it clear what exactly you were going to do to our production environment.

The deploy overview component worked nicely on the smaller test deploys but fell apart with a more realistic number of processes. A list of 30 processes doesn’t really work and we reworked it to use multiple columns instead

This worked far better with the smaller development datasets

Process Overviews

It was time to now turn our sights to using the tool in production.

We wanted to ensure that everyone had at least a basic level of understanding of Docker and felt like they could interact with and debug it in production so we ran sessions explaining Docker, starting at the basics to bring everyone up to the same level and then moving onto more complex topics like schedulers and debugging processes running in containers.

These sessions bought up some missing features, previously it was easy to find where a process was running incase you needed to debug it but now ECS could schedule it anywhere on our cluster and to find where this was required multiple clicks around the ECS console.

Also people needed to remember how to construct commands for exec’ing into containers and tailing the output.

To solve these problems we built a process overview into Fat Controller, this shows every ECS service we have running and clicking through to an individual process drills down into more details about it, such as the image it is running and for every process running as part of that service commands you can copy which SSH into the appropriate host and exec into that container or tail its logs.

Over time we added extra functionality to this page such as links through to the appropriate page on the ECS console.

The homepage provides an overview of the health of of our services

Clicking through to a single process gives you information to help debug it

Adhoc Tasks

Next there were concerns about how to debug issues which only appeared in Docker containers, so we created Adhoc Tasks which allows people to run any image they like with sleep <3 days> as the command so that they can exec in, and run up anything they like to replicate and debug issues.

With these blockers resolved and Fat Controller at V1 we wanted to make sure that everyone had used it and understood how things like rollbacks worked before they had to use it for real in production.

Fat Controller Parties

So we used our staging version of Fat Controller and held “Fat Controller Parties”, 2 1 hour slots where everyone in the engineering team used it at the same time to try deploying to our staging environment and initiating rollbacks etc to understand how the flows work and what buttons to press.

These went well and a nice side effect of them was that we got to experience for the first time what happened when multiple people tried to use Fat Controller at the same time.

It was mostly fine but we noticed that for understanding who did what it was handy to have peoples name recorded against actions (such as pressing deploy to production / rollback etc) and also we’d just been using peoples Gravatars to identify users and when multiple people just have the same default gravatar this becomes really confusing. So we also added there name beneath the picture.

We were now confident that we were ready for production and in the final part we’ll look at how we migrated over to using it.

Follow me on Twitter @AaronKalair

Part 1 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-1-anyone-for-tea-a5c12b984ed9

Part 2 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-2-more-jenkins-for-everyone-8b585e4239f9

Part 3 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-3-peas-anyone-772d46a8b7ed

Part 5 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-5-arriving-at-the-destination-18e7dd019a03