Triggered: Incident #1234 (incident process needs fixing)

Published in

The PayPal Technology Blog

6 min readJan 18, 2021

This is a story of a small group who, voluntarily, got together to improve the incident handling process within our company. It is not a “we did great, you should do the same” story, rather a “this is what we did, what would you do?” one.

Our product development department consists of cross-functional teams who have high autonomy, with full ownership of what they deliver. For example, we have a team that is responsible for the product catalog feature, from frontend to backend, from design to deployment. This model has many advantages, such as having full control over all aspects of the feature, but a side-effect is that each team tends to have their own engineering culture and practices, and there is no central authority that prescribes what processes each team should follow.

So the question is: if you don’t have a centralised control mechanism, how do you make changes that are intended to be propagated across teams? Depending on your context the exact solution will probably be different for your company, but there are some underlying principles that can help to foster the change. Our approach was to focus on two areas: alignment and transparency.

The story

We begin in late 2019. I was an Agile Coach, a role that involves helping teams to improve their working processes and culture. I personally care about incidents and believe that the learnings that come from postmortems are very valuable. Expressing my interest caused me to get invited to incident postmortems by different teams for facilitation.

One afternoon I sat at my desk and thought: “this is too much, everyone should be able to facilitate postmortems”. I wrote a three-page document on the subject and shared it with our CTO.

He said (paraphrasing): “This is great that you started this initiative. There are other people who have started working on this topic. Maybe it is better if you team up with them”

This approach turned out to be very revealing. After teaming up with various people from different parts of the company (Security, IT, Legal, Risk and Engineering), I realised that this was not as simple as learning how to facilitate postmortems. There was much more to it:

Technical incidents are important but not only for the engineering department. As soon as it affects the end-user, it becomes important for the risk and legal departments as well as customer support.
What constitutes an incident should be defined and communicated clearly, otherwise people will have a hard time deciding whether an anomaly is an incident or not.
Similarly, incident severity levels should be clear and well-defined.
The value of an incident report and postmortem is limited if you only focus on the immediate actions. There is a lot to be gained from identifying patterns, finding links to other incidents and detecting underlying issues.

The work started by identifying the different incident categories. We split the team into sub-groups to create documents for each category. Following are the categories we came up with. Depending on the business you’re in, these can differ.

Alignment through documentation

There were two of us in the Operational Incident category. When we started the documentation we soon realised that this was potentially a deep rabbit hole. In the extreme we could end up going all the way to listing out what each team owns. This would be of little value since it would most likely go out of date and, more importantly, it would be encroaching on the teams’ domains of expertise. So we decided to keep it high-level and let the teams keep the more technical detailed documents internal.

This is the overview of what we documented for the Operational Incident category:

What is an incident?
Being on-call
Incidents escalated by users
Alerting principles
Runbooks
During an incident
After an incident
- Postmortem Debriefing
- Postmortem Reports
- Tips for Effective Postmortems
Additional Resources

Detailed documentation is nice to have as a reference and a source of truth but it is too much to read at the time of an incident. So we created a procedure diagram on the first page of our documentation that is easy to follow under pressure.

Transparency through a shared list of incidents

To collect information about incidents, we needed a shared place to report them. So we created a portal where different categories of incidents can be reported as soon as the immediate incident has been mitigated. On the landing page the reporter can pick one of the four categories of incidents mentioned above. They are then presented with a form that is specific to that kind of incident. Here is an example of the form for Operational Incidents:

The expectation is that the report should be submitted as soon as the fire is out and the reporter can attach additional information later, such as postmortem outcomes.

In order to make the reports more transparent, a public Slack channel was created with an integration to the portal. Every time an incident report is created, a summary of the report will be posted in the channel automatically. This is the place where people can be informed of ongoing incidents, offer help, and share if they know something about a particular incident.

A tour of the new process

Everything needed to announce the new way of incident handling was in place. We decided to do more than an email. We needed a face-to-face conversation to let people ask their questions and gather feedback. So, we hosted a tour for each team, demonstrating the whole process from end to end.

The agenda of the session was:

Introduction
Incident portal
a) Security
b) Operational
c) Internal
Test scenario/Slack notifications
Where to find the documentation

Not long after the tours, people began joining the Slack channel and automated notifications started to appear.

Learnings

Here are my takeaways as a member of the core working group:

Keep the users close. In our case, our users were engineers and whoever might face an incident. In the Operational Incident category, we gathered feedback iteratively from the senior engineers for portal usability and documentation.
Work with others. Working in a group brings different perspectives to the table, makes the solution more comprehensive, and is more fun.
Be wary of making the new process too regimented. Get people to buy into it and understand the value it brings. You want to have people willingly contributing, not feeling compelled to fill out the report as a chore.

Still left to do

Now that the new process has been shipped, the improving phase will begin. Some areas we will look into in the near future are:

How to onboard new hires to this process? Should we rely on word of mouth or make an official onboarding process?
How to keep up momentum in running postmortems? The ideal scenario is to have people in each team who can facilitate postmortems and not rely on a couple of people in the company.
How to make it even more transparent and valuable to the company? One idea is to send out a quarterly email with a summary of incidents that are documented in the portal, and other important learnings.

Looking back at what we did, I realise that this was not possible if the core working group didn’t feel empowered. Freedom and feeling empowered was our main fuel in this team. Our aim was to introduce a practice that creates a valuable resource for the wider company while preserving the autonomy of individual teams. We are looking forward to seeing how it develops.