Microservices Hygiene Day

How Hailo ensures its microservices platform is kept up to date.

Hailo launched its golang microservices platform over 12 months ago and so far it’s delivered on all that it promised in terms of agility (in all senses), availability and scalability. Our attention to tooling on provisioning, debugging and monitoring have, for the most part, made the platform a dream to develop on and support. With over 200 services running and increasing every day we always knew that at some point we would need to think about service lifecycle and decommissioning microservices. Last week we held our inaugural ‘Hygiene’ day to begin to address this. What follows are some details and observations about what we did.

Someone who knows about these things (aka Martin Fowler) said “You can usually only judge architectural decisions after a system has matured and you’ve learned what it’s like to work with years after development began. We don’t have many anecdotes yet about long-lived microservice architectures”. We have not quite reached that milestone but we had begun to realise that some microservice on our platform were not being refactored, rebuilt or redeployed as much as we expected. Of course our vision is to automatically build and deploy everything regularly but this still remains our medium term goal so what to do? Having these old stable microservices is good in one way (it just works) but it meant our services were not taking advantage of new capabilities or bug fixes in our own or 3rd parties libraries. There is nothing worse that having an incident and it turning out that that particular edgecase was solved 12 months ago but this service was never redeployed!

While we wait for our CI/CD nirvana we initiated a ‘Platform Hygiene Day’ when all of engineering takes a day out of their sprints and focused on non-functional tasks. Merging fixes, resetting error levels, minor service refactoring form a part of this but for our inaugural session we decided to build using latest and greatest everything and redeploy to live as much of the platform as possible. Our CI/CD pipe is very capable of doing hundreds of deployments in a day but we had not done so in live in quite a while so there were a degree of apprehension. I certainly expected some low level impact during the day!

So here is what we did. We divided our engineers into 3 separate teams of engineers that normally do not work closely together — we assigned a team leader for the day for each group. Secondly we divided the entire list of microservices running in live into 3 lists ordered by build age and assigned a list to each team. We removed some of the kernel services from the lists because for this first pass we didn’t want to risk platform wide issues and kernel services would have that impact. Some teams did some prep — some didn’t. Nothing was mandated.

On the day we met up first thing (10am is first thing right) and called out to ground rules. Firstly, pizza was to be provided for lunch. We assigned an engineer to role of controller for the day to be point for conflicts, advice and cross team coordination on the code side. We assigned two operations focused engineers as guardians of the live environment to manage the flow of change to live and stop the flow if instabilities were detected — you needed a positive acknowledgement from one of them before deploying. Each of the three teams was given a room for the day to collocate and 10 minutes later off they went. During the day the team came up with lightweight solutions to issues as they came up and QA did quick regression tests on services of concern.

And the results…

We build and shipped 104 services on the day with over 5000 bug fixes included in there. We retired a handful of services on the day with more identified for retiring. Just as importantly we had zero incidents which is as much a testament to our platform as to the quality of our engineers. For the future, besides getting back to minor refactoring etc as mentioned before, we want to improve our tooling around mass deployments in the short-term to make these days less onerous ahead of getting our deployment pipe fully automated. Of course non functional work needs to be part of teams normal backlogs but having a day a month to focus on these tactical tasks is very beneficial to the team and helps guide those backlog items. Finally, we came out of the day encouraged by how well the platform and team performed and confident we had avoided incidents those unfixed bugs were going to cause. Overall a great day.

Try the Hailo platform for yourself — head over to https://github.com/hailocab/h2