A Chaos Test too far…
By Kathryn Downes & Arjun Gadhia
It was a warm, summer’s afternoon. We were looking forward to an after-work beverage at The Market Porter in the sunshine. What better time to do some harmless chaos testing in our staging environment, we thought.
Our team has been building Spark — an in-house content management system for creating and publishing digital content. The Financial Times has around 500 journalists around the world, across a dozen news desks. We’ve been working closely with a couple of desks from an early prototype — and to date, we have around 80 users (journalists, editors/subeditors, picture/data journalists, etc) that are publishing between 10–20 stories a day using Spark.
Spark is a collaborative, web-based tool written in React/Node.js, and deployed on the cloud to Heroku. The editor is built using the open source library Prosemirror, the content is stored in MongoDB and collaboration happens over websockets using a Redis caching layer for the changes coming through.
As the team has grown and the tool becomes more fully fledged, we became aware of knowledge silos starting to form. Inspired by our colleagues in Customer Products, we decided to host our very own Documentation Day. This was a fun (there were drinks and snacks) and productive way to both spread knowledge and decrease our operational risk.
One of the things we learned from this exercise was that nobody in the team knew how to back-up and restore the database. So the following day, two of us decided to go through the exercise in our staging environment and document the process.
This mini-chaos test taught us two valuable things that we wouldn’t have known otherwise:
- The database was set to back-up every 24 hours. This wasn’t really an acceptable timeframe for us or editorial.
- The process of restoring the back-up caused a few minutes of downtime (understandable), and even after it finished our apps needed restarting (this was less obvious and would have bitten us in an out-of hours scenario).
So all in all, a raging success. We increased our production back-up frequency to every two hours, and then as the day came to a close we quickly ran through the process again in staging so we could write it up.
It wasn’t staging.
It was production.
Holy moly, Production?!! But restoring to a back-up causes downtime?! And the last back-up was at 5am?!
Ok, this was bad. Spark was down and we had lost all the journalists’ changes from that day. This included important articles that were due to be published in the next couple of hours. They were going to be understandably, very annoyed.
We also faced losing all of our users’ trust and confidence that we had been working so hard to gain over the last few months. We would be back to square one if we didn’t get this fixed fast.
So, what did we do?
Instinctively we wanted to try and stop the restore. Our staging test showed us that the process took around 15 minutes, but we couldn’t find anything in the UI. We fired an email to our MongoDB provider, but it was too late to stop it. It was looking grim, the sinking feeling was sinking deeper and any glimmer of hope for restoring content was fading fast.
“Can we take a back-up of Redis?”, someone shouted. Despite the high stress and adrenaline levels, the team somehow managed to recall that we keep a temporary cache of our articles in Redis, to speed up our collaborative editing feature. This cache has a short time-to-live, so it was imperative that we pulled this down as soon as possible!
The restore had completed, and as we learnt to do earlier in the day, we restarted the app. Spark was now showing articles from 5 am that morning. We had already told our stakeholders what was happening, and to their credit, they were doing a fantastic job of protecting us from the inevitable questions trickling in from users. They managed to compile a list of the missing articles from FT.com, and prioritised them based on urgency.
With a local Redis in place, we confirmed that all the 37 missing articles were there! The format of the data stored is slightly different between Redis and Mongo, but we had enough to be able to manually recreate all the content. It was a slow and finicky process, but together we managed to bring everything back exactly the way it was, and most of our users were none the wiser! 😅
It was late. We were mentally and emotionally drained. It was time for the Market Porter.
So, thankfully the crisis was averted. Phew! Despite being totally stressful to deal with, when we reflected on the incident afterwards, there were a lot of positives we could take from it. We had learned so much about what we did and didn’t know and as a result were able to identify areas for improvement. It also demonstrated that we had been able to pull together and work really effectively as one team. We felt proud of how we responded.
These are some of the things we learnt and some of the changes we suggested:
- We now back-up the database much more frequently.
- Confusing Heroku database names had made the mistake too easy to make so we changed them to be super clear, STAGING and PRODUCTION.
- We updated our runbook to include instructions on data recovery as well as improving our on-boarding wiki after realising that some newer team members didn’t have everything installed, e.g. a Mongo client.
- Having one person take charge of coordinating the response to a significant incident is useful. They assigned tasks and liaised with our stakeholders/users, which let the other engineers focus on the fix.
- We didn’t just need engineers to help us recover from this incident, the input and knowledge of our editorial stakeholders was vital and good communication with all members of the team, not just technical was really important.
- Chaos testing is really useful and we plan to do more of it in the future!
Fortunately, we were able to quickly respond and fix this mistake, learning a lot and improving the app in the process. However, we all know that any one of us (even a Principal Engineer 😬) could make some other mistake in the future, we are only human!
At the FT we are lucky to work in a culture of no blame, where we recognise the complexity of our work and accept the inevitability of things going wrong. You are encouraged even, to make mistakes since they provide such rich learning experiences. When something does go wrong, we are supportive rather than rushing to pin blame on one individual. Try to look past the inevitable sinking feeling and look for opportunities for improvement.
Kathryn Downes — Engineer, Internal Products
Arjun Gadhia — Principal Engineer, Internal Products
P.S. We’re hiring!