In this first blog post of the series, you’ll learn about the way we crafted our MVP fail-over checklist and how it helps us become better and improve our infrastructure resiliency.
When I started working at Doctolib as a DevOps engineer, the availability of our production service has been plagued by stability issues related to our historical hosting provider (provider H16). I’ll refer to the providers with invented names in order to avoid any blame or finger pointing.
We had decided to leave provider H16 and to build a redundant setup based on two separate new providers, in order to radically improve uptime. The plan was to implement a Disaster Recovery Plan allowing us to fail-over production between provider 42A and C10 with minimal downtime.
We had already migrated all our production resources to provider 42A and started the decommissioning of provider H16, though we kept a few resources online for backup.
At the time, I had been working there for a month and my job was to build our redundant infrastructure on provider C10 and to ensure it was ready to receive a Disaster Recovery Plan (no major cloud providers like AWS or GCP was HDH certified).
When the trouble starts
The sun was shining, the birds were singing and we were still setting up provider C10, and in parallel decommissioning the H16 infrastructure, when our Customer Support Manager rushed in our office to alert us, before our monitoring tool do (it was a long time ago 😉), that Doctolib was completely down: no patient nor practitioner could access their favorite service.
After quick investigation we had found out that we had lost all services from our hosting provider 42A. We felt the weight of the world crushing on us but we decided to stay focused in order to restore all services as quickly as possible. The only input we got from provider 42A was that they had lost connection with their own Internet Service Provider and that they hadn’t estimated time for a fix.
Without any ETA for a recovery, we realized we only had two options:
- do nothing and wait for the situation to go back to normal;
- attempt a temporary rollback to provider H16, temporary because we knew it was not going to withstand a full production load the next morning.
Before taking any decision, we sent an email to provider H16 asking to stop decommissioning anymore of our servers, despite the process being ongoing.
We wrote down a quick list of steps needed to roll back production to provider H16:
- inform both providers H16 and 42A that we were performing a fail-over of our production back to H16;
- promote the DB replica, that we had kept in sync at provider H16;
- check-up all infrastructure and application services on provider H16;
- reconfigure DNS of all production domain names to point back to provider H16 (incompressible time: apply duration and TTL).
One hour had passed since the beginning of the incident and Doctolib was still completely down and yet we had no clue about when the service was going to be restored at 42A. So we decided to set our plan in motion and, a few minutes after, Doctolib’s production was up and running again on provider H16 (our legacy infrastructure). We were very satisfied with our first disaster fail-over and, this is at the moment, the seed of our fail-over checklist was born.
For sure we weren’t out of danger, the next step was to find a good way to go back to our main provider 42A; before next morning and the most important thing was: how to switch back to our main infrastructure (provider 42A) without any data loss.
Rollback all the things!
Provider 42A went back alive and we knew that we should stop, as quickly as possible, all applications and services (Web apps and Async jobs). We needed to keep the database’s state as close as possible to the state at the moment the Internet connection was lost. So our main mission was to reduce as much as possible any written operations to the database.
In fact, after the promotion of DB replica at H16, some asynchronous jobs have been treated at 42A and performed write operations to the DB that wasn’t in production anymore. Therefore, we had created a discrepancy between the database hosted at 42A and the one hosted at H16. However, we had decided to make a script to catch any discrepancies between those two databases. That could have been used too for catching up with the live master database on provider H16 if the synchronization from H16 to 42A was not possible.
In order to roll back from H16 to 42A, we decided to split the first responders in 2 teams and 1 coordinator:
- the team one was in charge of building a Ruby Gem scripts to catch up the delta between databases;
- the team two was in charge of designing the fail-back procedure;
- the coordinator was in charge of ordering pizzas (and keeping track of progress).
As stated before, we were pretty sure that the provider H16 won’t be able to withstand the next morning load. We absolutely had to roll back to 42A to avoid another incident.
The migration rollback process was a bit more complicated than the last fail-over. Because each minute of downtime was important.
The following fail-over checklist has been written:
- Put Doctolib services in maintenance mode on both locations.
- Transfer database (database folder helped by rsync).
- Rebuild database infrastructure (recreate read-only replicas and backup).
- Assess a global checkup.
- Put back online Doctolib services (DNS reconfiguration and disable maintenance mode).
We started the process at around 9:00 p.m. by putting Doctolib in maintenance mode.
We tried to reduce the transfer time between providers asking H16 to increase bandwidth, unfortunately they weren’t able to support us.
We tried different transfer options to find the best strategy to optimize the database transfer speed, while avoiding to lose data and keeping the consistency.
The transfer operation started, everybody went back home, and put his alarm clock, related to the estimation of time of the transfer, around 5 a.m. to monitor and continue the process to go back online. I went snoring in my colleague’s home and after we were waking up around 5 a.m., we shared the rest to be done and went to the office one at a time to work continuously on this subject; and back online as quickly as possible.
After some hours, we got a database up and running. We unrolled the previous fail-over checklist to put back all Doctolib services online. The applications were back online around 8 a.m. as if nothing had happened for customers.
This story was perfect to work on a new beautiful and useful post-mortem.
The post-mortem was very dense, long to write and obviously hyper interesting to analyse in order to devise all actions that should be taken. These actions were really oriented toward having a quick fail-over on a second data-center. For example:
- Reduce DNS reconfiguration duration.
- Have monitoring even if a DC has been lost.
- Reduce the number of things to reconfigure in the Doctolib applications.
And obviously, work hard on the checklist fail-over!
We already knew that we needed to write a fail-over checklist; but this was often delayed for other priorities.
The positive impact of this story: We saw concretely how and why we should write that checklist.
The time (one hour, under pressure) we spent, during that crisis, writing that first draft version of a process to repair our infrastructure was largely repaid as we knew exactly how we could put it back on its feet.
In fact, we based our work on our post-mortem timeline to begin what we will call later on, the “Fail-over checklist”.
This checklist was and still is a Google Spreadsheet which describes step by step what the operators have to do to process a fail-over of our production. This description of tasks must be a command line which can be copy/pasted directly to the terminal, without any change.
This fail-over checklist is developed like a product: We began it with a MVP and planned periodically fail-over to test and improve it.
- Practice often to be sure that your checklist still works fine.
- Improve it to reduce the delta, between planned and disaster fail-over and the downtime part of it.
Since we had it, we quarterly used it for training and for complicated components maintenance.
We had used our beautiful fail-over checklist one time for a disaster incident and we succeeded in migrating our production from broken infrastructure to our other one within 10 minutes of downtime (from the moment that we decided to make a fail-over).
As a funny side note, that was just a tree that put us down. Here is the e-mail from our provider summarizing what went wrong during the horror story above:
“As a reminder, yesterday afternoon, we experienced an exceptional climatic situation that led to a disruption of the power supply by ERDF in the A42 region in France.
In spite of the redundant internet links that we had set up, all our suppliers were affected, causing a break in the link between our Data Center and our IP (internet) forwarding agent.”
Don’t be scared to begin your own checklist with an MVP in YOLO style and improve it through short iterations, like we love it at Doctolib.
That was just the beginning. We’ll come back soon with other posts related to the building and improving of this fail-over checklist.
Did you like this article? Go subscribe to our tech newsletter for more!