One to three deployments per day: push where it hurts
Doctolib is well known for its monolithic Ruby on Rails web application. We are happy to argue why we chose that with you for hours because we have been serving north of fifty millions users per month, with no load scaling problems. So far…
This application was deployed before 5PM, everyday.
And, slowly, an issue arose: More developers (seventy plus today) means more commits sent to production by the end of the day.
No surprise, this translates to more opportunities for nasty bugs to slip between the cracks of our test suite.
As a rule of thumb, you can consider that the more changes you push to your system at the same time, the more risk of failure you introduce.
Six months ago, we decided to investigate our options regarding the reduction of those risks, but this initiative got put on hold. It stayed at the good intention stage, because we had, what felt like more important tasks to do at the time.
Trying times and exceptional measures
Following the actual viral outbreak, lots of companies had to take exceptional measures. At Doctolib, we started an unprecedented crisis response mode (which we’ll share about in another blog post).
One consequence of this crisis response was a huge and sudden increase in the number of changes we deployed to production. We were building more and seeing more features pushed to production.
This highlighted a couple of things.
First, the risk introduced by a huge amount of product updates became more than what we were ready to accept.
Next, not only was the theoretical risk above what we were willing to accept. On a couple of occasions this situation led us to a rollback, where some bad commit held hostage of many good features.
Three times a day and beyond!
As a response to those failures to deploy, we started to add a second roll out after the first one (or the morning after), that pushed all changes minus the faulty one.
This did not flipped over the way we used to deploy. But it showed us that pushing a couple of deployments per day was possible.
So we tried adding a third one, around noon. And it worked!
This took a lot of time from our Engineering Efficiency team. Who take responsibility for all deployment related tasks, in order to help product teams focus on building new functionalities.
When you are doing something once a day, if the task is taking 10 minutes, it’s still acceptable to run it manually (depending on the ratio between time needed to automate it VS time needed to do it manually).
But when you do it 3 times a day, suddenly, every ratio changes, and you have to reconsider your position regarding automation.
While running the deploy mill, the Engineering Efficiency team took one after the other the manual monkey tasks that were needed before and after pushing our web app to production, and tried to automate everything that could be.
While taking care of the deployment, previously, the person in charge had to:
- Check build status of every commit going to production
- Get in touch with all product managers and check that nobody had untested or problematic code on the production branch
- Check on the production branch that the translations are up to date
- Check on our internal crisis management tool that there is no production incident at the moment
- Check with the DevOps team that there is no infrastructure issue that could affect the rollout
All those operations where done manually by the duty person in charge of pushing to production, and were time consuming.
So the EE team chose a product oriented approach, learnt more about the problems of this process and built this:
So new, instead of having to get in touch with a dozen of people and check a massive amount of commits, the person in charge of the delivery just have to check this page.
All green means that this person is able to deploy (or rollback) our production with a single button click to get our production updated in 10 minutes tops.
Vision for our deployment process
Building a deployment process that can be run every 10 minutes is a great thing, but it’s only a positive side effect migrating to a deployment with zero manual operation. Our target was always the smoothest deployment possible.
We’re still not there yet, but we are in a good position to achieve that goal.
What do we gained from this move?
For a start, we no longer fear multiple deployments a day.
Did we lose something? Not really, but having three different rollouts of our master branch changed the game for our Product Managers. Having to do a manual QA check of what is going to production thrice a day can be time consuming, and not the best way to ensure product quality.
This raised a question: We are now wondering if we should still do manual testing on top of our automated test suite?
And this raised another idea for the future of this process: we need to invest massively in the smoothest feature flipping/switching system.
If you were to keep only a couple of ideas from this article:
- Push where it hurts. It’s a great way to kill a pain point by learning more about it.
- The more you deploy, the smaller the changes, the lower the risk. Increase your amount of deployment according to your amount of contributors.
Big shoutout to Thomas Bentkowski and Bertrand Paquet that helped me write this article.
If you want to learn more about our tech team we write a weekly newsletter you can sign up for here and if you want to join us, we are hiring!