How We Crafted Our Postmortem Culture

Published in

Yellowme

6 min readMar 13, 2020

A couple of months ago, one of our teams experienced for the first time an incident in production that had the potential of affecting hundreds of users and their overall experience with the product. Luckily, the engineering team quickly deployed a hot-fix without any major user reports being made.

After the storm had passed, the team found appropriate to meet up and discuss what happened and how to prevent it from happening again. We realized the root cause had less to do with an introduced bug and more with the way communication flowed between the development team, the quality assurance department and overall, the way the deployment process was organized as a whole.

During the meeting, the team was intentionally trying to create a blameless environment where we were looking for solutions and not for someone to blame. Without being completely aware of it, we were acquiring an incident management technique better known as a postmortem meeting.

Our team works for a two nine available product that requires to be operating ideally 24/7 and with changes and new features being deployed constantly, it is impossible to believe no more incidents like the one we experienced will ever happen again.

We came to the realization that we can not attempt to make zero mistakes. Instead, if mistakes are going to be made, the best thing we can do is attempt to make only new ones and learn as much as we can so we can keep them from happening again.

Learning means sharing

We strive to create a culture where developers do not feel like they need to hide their mistakes from their own team. The impact of an incident is not measured by how many people at the company heard about it. It makes no sense to hide it. No developer should try to solve the issue in silence.

Making sure we learn as much as possible from each incident, even if we are not directly related to it, was now our top priority. If an issue occurs, after it is solved, we want to encourage developers to feel the need to announce to their team: “Hey! We made a mistake and we want to share it because it looks like the root cause concerns us all”.

Writing a postmortem should never be seen as a punishment. As cliche as it sounds, mistakes have the power to turn you into something better than you were before.

Own your postmortems

As we got into investigating the topic of incident management, we found a couple of useful resources defining what a postmortem should look like. But just as we do every time we introduce new actions into our methodologies, we wanted our brand new postmortem process to fit the way we work.

Even thought we are aware that modifying the existent development dynamic always implies an unmeasured effort and time, we are always looking to make new processes as simple and repeatable as possible. We came together as a team to craft our first incident management process that best suited us.

One of the best resources we found during our quest was the Atlassian Incident Management Handbook. Our team is by no means as big as the ones described on the handbook, so we needed to adapt the suggested postmortem process to our team’s current needs and capabilities.

Team size, project lifetime and existent documentation tools were some of the factors we had to take into consideration when structuring our protocols. Always trying to make the process as implementable as possible.

Determining severity

Before proceeding to analyze any incidents. We needed to define a severity classification in order to determine which incidents will require a postmortem and which one will not. In our case, we decided to set a three-level severity scale.

A critical incident with very high impact
A major incident with significant impact
A minor incident with low impact

All incidents type 1 and 2 are required to be followed by a postmortem after being solved. Incidents of type 3 are optional.

Our first postmortem

Our postmortem document template ended up looking like this:

Naming convention: <Incident issue key> — <Severity> — <Incident summary>
Incident summary Brief description in a few sentences responding to what, why and for how long. It must include de level of severity detected.
At around <time of the incident> on <date>, customers reported <failure event>. <Number support tickets> related to this issue where raised. The event was triggered by a deployment by <team>. The deployment was intended to include <changes>. The bug caused <description failure>. The event was detected by <team or system>. <actions> where taken in order to mitigate the incident. The severity level was determined as <level>.
Lead up What circumstances might have led up to this incident? Was something deployed? did any change happened?
At <date and time>, a change was introduced to <service or module> with the intention of <purpose of changes>. This changes caused <description of impact>
Root cause Identify the root cause that might cause similar incidents to occur in the future.
A change in <service> lead to <affected modules>. Because this change was unperceived to have affected <service>, <incident description> happened.
Impact How many clients were affected by this issue? Where there any support tickets created caused by this issue? Was there any internal impact?
On <date>, for <length of time>, An amount of <number of users> (<percentage> of total users) experimented <incident symptoms>. <Number support tickets> related to this issue where raised.
Detection How and when was the incident detected? Describe the process that made the issue discoverable.
This incident was detected when <system, service, team> received <type of alert, tickets raised>. Incident was reported to <team or person paged> on <date and time>
Response Who was responsible for responding to this issue? How did the response happen? Was the person responsible available when the incident happened?
<Team or person> responded to the incident respond at <time>. However, the on-call engineer needed access to an unavailable key and <team or person> was paged.
Recovery Describe how much time was needed to restore the service and which actions were required.
At <date and time> a deployment was made with <summary of fix> and service was fully restored.
Recurrence Has this incident (with the same root cause) occurred before? If so, why did it happen again?
Incidents <bug identifier> and <bug identifier> appear to caused by the same root cause.
Lessons learned Describe what was learned from the incident; what went well and what could have been improved.
1. Need quality assurance verification before all feature releases.
2. Feature release should include a mechanism to enable disabling.
Corrective actions Identify what actions need to be taken in order to ensure this incident does not happen again. It might be a good practice to always include specific tasks that can be assigned and tracked.
1. Implement feature flag strategy
2. Introduction of QA revision stage to development process before merging

Follow-up

On every new incident, it is ideal to assign a member of the team responsible for responding to the issue as a postmortem owner. His role is to ensure the postmortem is written and any meeting that needs to occur happens.

If considered necessary, a meeting following the postmortem is highly recommended to share the overall knowledge of the incident and the corrective actions to be taken that might involve other teams.

After each postmortem is reviewed and approved, it should be added to the repository of all past incidents.

We want to make sure these resources are available to all team members at any time.

Scaling the process

In our attempt to make the process sustainable and able to scale, we made sure all team members were aware of the initiative and agree with the way the process was going to be carried out from now on.

We made our template, a how-to guide and our first postmortem available as resources for the whole team to transform the initiative into a stable process.

It was crucial to communicate to everyone involved the importance of carrying out all corrective actions agreed by the team on each postmortem.

There is no use in filling up documentation with suggestions followed up by zero action.

Postmortem should dictate on detail immediate actions to prevent similar incidents from happening again. To achieve this goal, we found it extremely useful to link all corrective actions to a specific task including an assignee on our task board to ensure trackability and delegation.

Next actions

Postmortems take us one step closer to having a defined head-to-toe incident management process.

Our future work includes defining Incident Response protocols, establishing communication channels specialized on incident reports, formalizing delegation of responsibilities, defining an incident report process and determining how and when to initiate communication with the clients to inform them an incident is currently occurring.

We are not only proud of our mistakes but we are also looking forward to keep learning from all the new ones that will come.

Great Resources

Atlassian Incident Management Handbook

Google Site Reliability Engineering