Back from disaster in (under) 15 minutes — Part III/III
Part3: Return of experience — how did our first exercise go ?
This 3 part article presents a pan-european ecommerce platform consisting of hundreds of distinct products developed by independent teams, sporting a microservice architecture and sharing a common technical platform. We want to show how to recover from a resource failure disaster in (under) 15 minutes.
In Part 1, we discussed our design choices and the motivation, as well as the tools needed to make disaster recovery a reality. Part 2 detailed the switchover automation. This third part will relate when theory meets practice.
Any disaster recovery plan looks well on paper. Testing it in a dedicated “proof of concept” environment shows it could work. But in the middle of a real life crisis, when the time comes to hit the “emergency” button to fire the failover automation, plenty of questions will make you hesitate: will it work at scale? How long will it take to complete? What do the users see? Will we face any side effects? Shall I trust the plan finally?
Scope
The only way to avoid the incertitude is to regularly test the DRP. Now how do you do disaster recovery with 200+ products that are running daily ?
The easy step is first to test our pre production environment. Next we decided to cover the products progressively. We worked closely with feature teams in charge of highly critical products for our business — in store or on website — such as product catalog, product search, mobile device capabilities, order management, payments, delivery, etc.
A panel of 15 products was selected for a first iteration to cover a large set of failure possibilities in all major functional contexts. The number of products still allow individual attention should things go wrong during the exercise.
Timeline
Finally, the moment to trigger the switchover in real life for the first time has come! We decided to run the automation on day one and roll back to our nominal region 2 days later. This way, we could also ensure that products run without any impact on recovery infrastructure for more than a complete 24-hour cycle, and observe how our platform behaves as a whole while changing regions. During the test, all stakeholders were involved as observers and ready to intervene.
Performance data was captured simultaneously with standard production load.
Forwards…
Technically the failover automation took 15 minutes to run, but didn’t complete the run. As expected, we had some unexpected events! Mainly minor automation bugs and misunderstanding errors and timeout management. Well that’s what exercises are for: identify issues that a dry run cannot detect.
To finish some teams had to manually run part of the workflow to help it to complete, and to correct some design issues. It took 2 hours to bring the platform back to a fully functional status.
In all, user impact was limited since only a few products were concerned. On the production environment, the impact would have been errors on a few pages at some specific points of the user journey, until products were fixed.
…and back
The same workflow is used to revert the infrastructure back to the original region. Thanks to lessons already learned 2 days before, identified bugs have been corrected. This time it ran to completion without any human interaction, in less than 13 minutes.
For website simulated end users, the operation was more smooth with only a few errors on some functions in the 2 minutes it took to concretely switchover DBs and traffic.
Usefulness of observability
In a disaster scenario, we need more than launching an automated workflow, and then wait to finally know if it went well or wrong. Our monitoring (side) solution tells us exactly where we are in the workflow, what is done, what is ongoing, what has failed, and all that in real time.
In case of an issue during a failover, we knew that time is very important; but we are now also convinced that monitoring drastically decreases stress. In addition to reducing the recovery time, well-done monitoring provides teams with a sense of control, calm thoughts and helps to think without stress!
What went well
Good news: the first take-away from this exercise is that it works! Our disaster recovery plan did what it was designed to, behaving at scale like how it was working with a single app.
Questions regarding delays needed to orchestrate moving multiple apps are answered, and the figures are pretty good: 13 minutes failover, and errors / latency during only 2 minutes.
We’ve also confirmed how helpful the monitoring tool is. As argued previously, we ensured the quality of observability was good at every step of the automation with the objective to always see what’s happening in real time. Goal reached: teams were in total control all along the way.
But the best benefit from this exercise is the overall confidence gained regarding the solution! Trust me, the atmosphere was totally different between the forward and back operations. While stressed for the first run, we were fully confident during rollback. Knowing exactly how the solution behaves totally changes things. It was close to a non-event actually.
What went wrong: lessons learned
Benefits of this exercise also come from what did not go as planned!
For example we’ve highlighted that some product configurations were incorrect. E.g. some DB services were not configured to allow connections from recovery clusters, and some minor workflow bugs were identified, non reproductible in dry runs.
Once identified, that has been all corrected quickly, and was absent during rollback.
We also faced some long running operations, but the workflow has been designed not to wait till these tasks ended — provided that they are not on the critical path — after a defined timeout. We just need to make it more obvious if such tasks have failed, asynchronously.
And finally, we also found unexpected external dependencies with certain legacy apps. We don’t want to deal with it while setting up a DRP, that’s just a factual reminder to decommission legacy infrastructures while moving to the cloud.
Limitations
What we’ve achieved here is great: great for the guys involved, great for the company, great for our customers. But that’s only the beginning of the road. We did this in pre-production. The next challenge is the production environment exercise to come soon, of course.
We did it with only 15 products, albeit critical ones; there are some 185 other products to be handled.
And for this reason, we did not stop nominal ressources during the exercise, to not impact all the products not onboarded yet. We know that, with our 15 DRP-compliant products only, business services could not have been rendered anyway.
And finally, we were not in a full crisis event, involving time to detect the failure and scope it, time to evaluate resumption delay and time to decide. The latter won’t be negligible for sure!
One step at a time!
Next round expectations
The perfect solution cannot be found at the first try! So, on our iterative approach, we’ve pushed some topics to the next iteration.
Of course we plan to repeat the test in production with the additional products ready by then. But the disaster recovery solution itself can also be improved in several ways.
We need to make it more cost efficient by minimizing the footprint of recovery infrastructures, while guaranteeing their immediate availability when needed.
We will change the exercise approach: not the exceptional exercise to run a few hours or days on a recovery system instead of running on a nominal system. It would be more beneficial to have a ‘A’ site and a ‘B’ site, and then switch from one to the other regularly, and stay for one month at a time.
For the mid term approach, we will invest in another cloud outside GCP to host our ‘B’ site, making our platform even more resilient. This should be achievable with little rework thanks to the Kubernetes + DBaaS providers approach.
Then other topics will be studied such as finer failover scopes — should/could it be a single product switchover ? — or making both regions active together and serving traffic for an even better availability rate.
Anyway, continuously improving our Disaster Recovery Plan is probably the best way to keep it alive and fully ready, for the day we will need it!

