Are we ready to recover from a disaster?
By Nayana Shetty
We spent a week deliberately breaking our services…
Why did we break things?
“On a long enough timeline the survival rate for everyone drops to zero” — Fight Club 1999.
The same is true for the systems a technology team looks after. Eventually everything breaks. So, how do you prepare for disasters? The infrastructure delivery team spent a week finding out.
Last year we built a wide range of monitoring services, to go with the good number of systems built or managed by other teams that we’ve inherited and now have responsibility for.. The more services and tools a team owns, the more we need to share knowledge and document how they work, so that when something breaks, anyone in the team or operations can recover them. Losing any of our monitoring services means we have lost visibility of the health of the FT’s key systems and services — this is a high risk for our business.
Disaster doesn’t always mean an earthquake or fire, it can occur in many different forms from natural disasters to human errors as is depicted in the picture below.
How did we organise breaking things?
So we decided that to increase knowledge of (and find the vulnerable parts in) our services we would spend a week breaking as many of them as we could.
It is important that the team understood why we wanted to do this work to get ‘buy in’ to working towards a common goal. So we built a clear plan together.
We started off with a high level plan on services where we thought failure would have the highest impact on the business, and came up with scenarios to test them out. We used google apps to set up internal websites, issue trackers and detailed test plans. To ensure we captured learnings we used simple retrospective questions like: “What went well? What could be improved if we were to do this again? What would we do differently?”.
Every morning we had a catchup to go through the tests for the day and split the team into small focused groups, making sure that the members in the groups rotated everyday with the view to get everyone pairing with each other at some point in the week.
The week started off with us wiping our laptops and seeing how sleek we were with configuring our laptops with all the required tools for the job.
We then carried on to actually breaking our systems and saw how we recovered them. This included:
- Bringing down Graphite (our metrics tool that monitors and graphs numeric time-series data) in a region and running the service from a single AWS region.
- Deleting a couple of databases on AWS and recovering from snapshots.
- Deleting dashboards in Grafana (our Data visualization & Monitoring tool) and recovering from json backups.
- Deleting users on systems and recovering them using set procedures.
- Running splunk edge forwarder (our Log collection tool) from a single AWS region geographically farther from the actual traffic to see what the impact of latency is.
We uncovered a wide variety of gaps; from gaps in monitoring, to gaps in documentation, to vendor issues for tools we relied on. We also found that the team would benefit from a standard CI/CD procedure across all our services.
What did we learn from breaking things?
We found gaps in our systems. However, reassuringly, it also proved to us that our systems are mostly built to be failure tolerant. Our customers did not notice any issues in the services while we tested our resilience.
The week also gave the team a different perspective on building new systems. We can now build better disaster-tolerant systems by design rather than bolting on resilience as an afterthought.
Finally, the Disaster Recovery week proved to be a great opportunity for the team to learn from each other and it boosted each team member’s confidence in their ability to handle service issues and disasters. We know our services better, we know we can recover them if they break and more importantly we know we can trust each other as a team to deal with a disaster.