Supporting our systems through incident workshops
How we are recreating old incidents to ensure our engineers are prepared for the unexpected
Within the Customer Products group at the FT (we work on FT.com and the FT mobile apps) we’ve faced an interesting situation this year, something for us to be proud of. We have experienced only a small number of incidents that have significantly impacted our products, this is in stark contrast to previous years where out-of-hours calls to engineers were much more frequent.
Alongside this, our group’s tech strategy includes a whole area of focus dedicated to “making our out-of-hours process sustainable”. With the risk of us being unprepared to handle our next major incident we set out to plan a training regime on how to manage incidents, with an aim to get more engineers confident enough that they would be happy to join the out-of-hours support rota.
The workshops are designed to be mostly paper based, and lean heavily on small teams collaborating and bringing together their knowledge of incident management and our systems. The format was borrowed from a workshop previously run by the Operations team which prepares people across the business for what to do in the event of a significant company wide incident.
Preparation has typically taken one of us an entire afternoon to put together the slides (a good tip is to print them off as A3, one deck per team, with a blank page for note taking). We’ve based the first two workshops off actual incidents we’ve experienced, using the screenshots, incident timeline and other details we have recorded in the incident reports, which saves a lot of time!
Some fun things I’ve tried to include that we often experience:
- Red herrings, by including a list of irrelevant changes or Slack messages that potentially line up with the incident, or systems people have a bias to thinking are always breaking
- Timezone offsets in some of the graphs, are we in UTC or BST?
- A final run through of what actually happened in the incident
The workshops then work as follows…
- Small teams are formed, 4–5 people in each
- The moderator / “incident lead” hands out the first page of information which has the background to the incident (one per team to encourage collaboration)
- Teams are given 5–10 minutes to discuss what they are thinking, and are told they will be asked three questions after this. “What more do you want to know or find out?”, “Is there any action you would take right now?”, and “What are you doing to keep on top of the incident?”
- The moderator pauses the discussions and asks the questions to each group, then the next page of information is handed out
- Repeat steps 3 and 4 until you’ve handed out the last page of information which wraps up the incident
- Ask the groups what follow up actions would go into the incident report
People are encouraged to use whatever they think maybe helpful. We’ve seen people look up our system runbooks, open up Grafana dashboards, and share their existing knowledge of how our systems work. It’s a brilliant way to spread knowledge, and there is very little pressure on the moderator having to know anything at all!
After our first two sessions we asked for feedback and received some wonderful responses:
I am keen to work towards being on the out-of-hours rota.
It was very encouraging to hear more senior staff say that they weren’t sure what was happening, even if they had a hunch.
[I’ve learnt] to focus initially on comms and customer experience, and less on finding the technical root cause.
We took the opportunity to ask “how likely are you to help out during an incident?” both before and after the workshops. It was great to see such an immediate positive shift in attitudes.
During the second workshop two people also expressed interest in joining the out-of-hours rota, which was really rewarding to hear!
We now have workshops planned twice a month for the rest of this year. I’m calling it the ‘Winter of Incident’ Training 🥶.