All Aboard the Deploy Train, Stop 5 — Arriving at the destination
We put together a plan that allowed us to observe Docker containers in a real production environment being used by our customers to validate everything was working correctly, minimise the risk of a customer facing incident and also cause the minimum disruption to people who were trying to deploy to production.
The plan was:
- Move a single process that powered a feature that would cause limited disruption to users if in the worst case it went offline
- Move our workers
- Move our pollers
- Move our static and webs last as issues here would have the largest customer facing impact and by this point we expected to have ironed out any issues with the previous migrations
During the migrations we always ran the processes in parallel on the old and new infrastructure so that if anything went wrong during the migration we still had working processes on the old infrastructure and there should be no customer impact.
This came with the downside that deploys now required using both tools to deploy to the old and new infrastructure which made life slightly more painful for people but we kept the time where we were in this situation as short as possible whilst being sure that once we made a hard switch over we were confident that everything was working correctly.
The things you only see in production
The migration was pretty smooth without any customer facing incidents. We did see small issues with ECS / Docker though.
As we started running a large number of processes on top of ECS we noticed small issues with ECS such as the ecs-agent
(the process you run on your servers that communicates with AWS and makes a server part of your ECS cluster) becoming disconnected and not automatically reconnecting, which we fixed with an automated Sensu check.
We also saw some issues where our memory allocations were too loose and a rogue process was able to consume all of the memory on a server causing the entire server to become unresponsive, this was particularly fun to debug as once the box was killed by AWS the scheduler would relocate the process to a new box which would then eventually be killed for the same reason.
The final issue we saw was that the Docker daemon becomes unresponsive when a machine is under memory pressure, we hadn’t configured the ECS_SYSTEM_RESERVED_MEMORY
value correctly and were consuming all of the systems resources running our processes without leaving enough for the non containerised processes.
With the majority of our processes migrated to Fat Controller the project drew to a close, we’d replaced our deployment tool with one that met our goals and provided us with a platform we could build upon.
Improving the Fat Controller
We now release small features at more regular intervals to solve different use cases people have and tidy up any of the rougher edges.
For example occasionally builds will flake causing the deploy to rollback so we added the option to retry your build directly into Fat Controller, this means people don’t have to interact with different tools during a deploy and makes it easier for new people to learn how to retry builds.
People also wanted the ability to merge some code into master without deploying it anywhere (for use with things like test improvements), a process we made completely hands off saving hours of developer time already, that would otherwise have been spent actively waiting for a merge slot and checking on the status of there build.
And this is the end of our train journey, we’ve built a tool to meet our deployment needs for the foreseeable future, made sure it’s usable and fixed peoples pain points and are able to iterate on it with small improvements quickly and easily!
Follow me on Twitter @AaronKalair
Part 1 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-1-anyone-for-tea-a5c12b984ed9
Part 3 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-3-peas-anyone-772d46a8b7ed
Part 4 — https://medium.com/@AaronKalair/all-aboard-the-deploy-train-stop-4-iterating-on-the-ui-b26e1962083f