Playing the blame-less game
Here at ASOS, we have over 60 engineering teams developing and supporting many APIs and services that keep the business running 24/7 globally. Alongside these teams we also have the centralised support of our Major Incident Management (MIM) team, who coordinate the people involved in the resolution of major incidents and report progress to the key stakeholders.
In recent months we have been rolling out a Blameless Postmortem process to help align our different teams, drive better outcomes and ultimately improve reliability. This is where we’re at on our journey of continuous improvement.
Over the years, inconsistencies started to creep into the information gathering activities associated with incident reviews and the way reviews were communicated and tracked did not scale as ASOS Tech rapidly expanded. After an incident occurred and had been resolved, MIM would run a Post-Incident Review session with those involved, which would provide a Root Cause Analysis and try to ascertain the impact of the incident. The individual teams who were involved in the incident were left to decide if there were any follow-up actions and take responsibility to remediate them if necessary, however there was little done to follow up on this to ensure that remedies are in fact implemented. The other issue we found was that each team logged and actioned things in different ways. It was inconsistent and therefore nearly impossible to track across different teams.
Site Reliability Engineering — the early days
A successful online experience is one that is fast and reliable, thus ASOS Tech established an SRE team to provide dedicated support and guidance to all of the platforms in respect to areas such as reliability, availability and performance. The team is still growing (check out the end of this article for links to the opportunities we have currently), however in its inception, it was mainly built on volunteers from other roles in Tech and one of the first initiatives was to look at creating and implementing a Blameless Postmortem process.
(Blame)less is more
So what is a postmortem I hear you ask?
“A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s) and the follow-up actions to prevent the incident from recurring”
This definition, along with the processes that we have adopted have largely come from or been inspired by Google’s SRE Book, which is a fantastic resource for starting out on your own SRE journey.
Many companies will look at the definition of the postmortem and think ‘we already have this’ and often that is true. The main difference with running a blameless postmortem is how this written record is created. It’s about culture; about not pointing the finger and instead focusing on actions and learning from our failures.
“The cost of failure is education”
Devin Carraway, Google
By making this shift, the output of incidents becomes very much focused on actionable solutions. ‘If we do these things, this incident should not occur again’.
A lot of the information gathered through carrying out a postmortem was previously just fed back to the Incident Management team for them to log in their issue tracking system and feed back to stakeholders. The learnings for the engineers were essentially ‘lost’ as there was nothing in an easy to read format that could be reflected on or referred to at a later date. This is one of the key issues we have sought to address… ‘How can we learn from this incident?’
We addressed this mainly by collaboration and knowledge sharing. These are the key principles we follow:
1. Key people involved in an incident from all impacted teams should contribute to the postmortem document
This allows us to build a centralised timeline of events, covering all affected areas. It gives us a better view of the sequence of events, how they were detected and how they impacted each area. We want to avoid each team running their own as it creates gaps in information.
As an engineer, it’s good to understand how issues with your service can impact other dependencies, both upstream and downstream.
2. Postmortem meetings should be scheduled within 1–2 days of the incident being resolved
While things are fresh in our minds, we start working on the postmortem document as soon as possible after the incident it resolved and share it with the group so that others can start contributing. We do this for any P1, P2 or P3 incident. However, diving straight into a postmortem before the dust has settled will be inefficient, as all the facts may not yet be to hand(plus if the incident occurred overnight, give those called out a chance to catch up on some sleep!)
3. Blameless Postmortems aren’t about pointing the finger or passing the responsibility (clue’s in the name!)
Blameless Postmortem meetings should focus on identifying root causes and creating actions to resolve them. Keep the discussion away from ‘you should have done x’ and focus on learning and preventative actions. You’re all working for the same organisation, you’re a team, support each other and work towards a solution!
4. The process isn’t owned by the Incident Management team
We needed engineering teams to become more invested in improving reliability and by making engineers owners of the process, it has helped us build some accountability. For major incidents (P1 and P2), our process means that the Major Incident team will initiate and run the postmortem meeting, which will ensure consistency across all the most important issues, however for other lower severity incidents the engineers themselves are the ones who run the sessions and document the output. How and where things are documented remains the same regardless of who runs the session.
5. Share your learnings
During the postmortem meeting, the people involved get the opportunity to learn from others, understand more about the systems they look after and share their own knowledge and experience. Although we don’t record postmortems, we encourage people to come and join them (even if they weren’t on support that day or are in another team). Learnings should be shared back with the teams that were affected by the incident and if relevant, shared with the rest of Tech if it could provide benefit to others.
Ready for action!
The other issue was around actions. In essence, there was no single trackable process for actions that held teams responsible for resolving them. When teams identified issues sometimes they would go on a backlog, sometimes they might be in a email to the team or sometimes nothing happened at all. There was no consistent approach and just putting something on a backlog doesn’t mean it will get prioritised and actioned. So we needed a better solution.
One thing that is consistent with regards to incidents is that they all get logged in our issue management system. So when an incident occurs, everyone involved is given the reference number to track progress and add information to it. This system also has a feature that allows us to raise ‘Problem’ tickets, which can represent actions off the back of an incident and be assigned to the team or teams that are required to resolve them. It makes sense to use this for storing our postmortem actions.
The next step is to build on the culture of responsibility and make sure that the right people are accountable for closing down the actions. In our world at ASOS, that accountability sits with the Platform Leads, who look after each platform and ensure the teams within it are focused on the right priorities. There is no point making the engineers accountable for closing down actions if they can’t prioritise it themselves. They’re responsible for making it happen. All the Platform Leads have a duty to ensure their platform is keeping on top of incidents and any resulting problems and so it felt right that they should be the people to prioritise taking action. Of course, the engineers themselves have a say in this process too, however by having the actions centralised, it allows the Incident Management teams and SRE teams to gain visibility on how each platform is responding to incidents and put pressure on them from outside the platform itself.
The underlying message here is:
Centralise your actions and build accountability into completing them
Now that we have our process in place, we are working on making sure that it is well adopted. At first we didn’t have MIM’s involvement and so it was hard to get teams to adopt the process independently, but now they are requiring blameless postmortems be carried out, we should see all teams getting more familiar with the process and adopting it.
Beyond that, we will look at further building on the culture of postmortems and SRE in general, potentially introducing ideas like ‘Postmortem of the month’ to champion best-practises and celebrate our learnings (from failures)!
Like what you see?
Blameless Postmortems are just a small part of the exciting and evolving role of Site Reliability Engineering. If this is an area you’d like to get into, or if you’re already doing this and want to help us grow and build upon what we already have, check out the two roles we are currently looking for below:
SRE Platform Lead: Are you a leader with a passion for collaborating and driving changes that fundamentally alter a business’ capability to scale and perform? Check out the Platform Lead role: link to details
Site Reliability Engineer: Are you an engineer looking to work in a varied role with others to deliver improvements in reliability and performance across loads of different areas in Tech? If so check out our opportunities: link to details
Other roles at ASOS can be found here
A little about me
I’m a Principal Software Engineer at ASOS with a passion for delivering quality, reliable and secure software. I work with teams to drive quality and best-practises across various services that operate 24/7 at huge scale!
Google’s SRE Book: https://sre.google/sre-book/table-of-contents/
Building a Postmortem Culture: https://sre.google/sre-book/postmortem-culture/