Triggered: Incident #1234 (incident process needs fixing)

Sara Rabiee
Jan 18 · 6 min read

This is a story of a small group who, voluntarily, got together to improve the incident handling process within our company. It is not a “we did great, you should do the same” story, rather a “this is what we did, what would you do?” one.

Image for post
Image for post

Our product development department consists of cross-functional teams who have high autonomy, with full ownership of what they deliver. For example, we have a team that is responsible for the product catalog feature, from frontend to backend, from design to deployment. This model has many advantages, such as having full control over all aspects of the feature, but a side-effect is that each team tends to have their own engineering culture and practices, and there is no central authority that prescribes what processes each team should follow.

So the question is: if you don’t have a centralised control mechanism, how do you make changes that are intended to be propagated across teams? Depending on your context the exact solution will probably be different for your company, but there are some underlying principles that can help to foster the change. Our approach was to focus on two areas: alignment and transparency.

The story

We begin in late 2019. I was an Agile Coach, a role that involves helping teams to improve their working processes and culture. I personally care about incidents and believe that the learnings that come from postmortems are very valuable. Expressing my interest caused me to get invited to incident postmortems by different teams for facilitation.

One afternoon I sat at my desk and thought: “this is too much, everyone should be able to facilitate postmortems”. I wrote a three-page document on the subject and shared it with our CTO.

He said (paraphrasing): “This is great that you started this initiative. There are other people who have started working on this topic. Maybe it is better if you team up with them”

This approach turned out to be very revealing. After teaming up with various people from different parts of the company (Security, IT, Legal, Risk and Engineering), I realised that this was not as simple as learning how to facilitate postmortems. There was much more to it:

  • Technical incidents are important but not only for the engineering department. As soon as it affects the end-user, it becomes important for the risk and legal departments as well as customer support.

The work started by identifying the different incident categories. We split the team into sub-groups to create documents for each category. Following are the categories we came up with. Depending on the business you’re in, these can differ.

Image for post
Image for post

Alignment through documentation

There were two of us in the Operational Incident category. When we started the documentation we soon realised that this was potentially a deep rabbit hole. In the extreme we could end up going all the way to listing out what each team owns. This would be of little value since it would most likely go out of date and, more importantly, it would be encroaching on the teams’ domains of expertise. So we decided to keep it high-level and let the teams keep the more technical detailed documents internal.

This is the overview of what we documented for the Operational Incident category:

  • What is an incident?

Detailed documentation is nice to have as a reference and a source of truth but it is too much to read at the time of an incident. So we created a procedure diagram on the first page of our documentation that is easy to follow under pressure.

Image for post
Image for post

Transparency through a shared list of incidents

To collect information about incidents, we needed a shared place to report them. So we created a portal where different categories of incidents can be reported as soon as the immediate incident has been mitigated. On the landing page the reporter can pick one of the four categories of incidents mentioned above. They are then presented with a form that is specific to that kind of incident. Here is an example of the form for Operational Incidents:

Image for post
Image for post

The expectation is that the report should be submitted as soon as the fire is out and the reporter can attach additional information later, such as postmortem outcomes.

In order to make the reports more transparent, a public Slack channel was created with an integration to the portal. Every time an incident report is created, a summary of the report will be posted in the channel automatically. This is the place where people can be informed of ongoing incidents, offer help, and share if they know something about a particular incident.

Image for post
Image for post

A tour of the new process

Everything needed to announce the new way of incident handling was in place. We decided to do more than an email. We needed a face-to-face conversation to let people ask their questions and gather feedback. So, we hosted a tour for each team, demonstrating the whole process from end to end.

The agenda of the session was:

  1. Introduction

Not long after the tours, people began joining the Slack channel and automated notifications started to appear.

Learnings

Here are my takeaways as a member of the core working group:

  • Keep the users close. In our case, our users were engineers and whoever might face an incident. In the Operational Incident category, we gathered feedback iteratively from the senior engineers for portal usability and documentation.

Still left to do

Now that the new process has been shipped, the improving phase will begin. Some areas we will look into in the near future are:

  • How to onboard new hires to this process? Should we rely on word of mouth or make an official onboarding process?

Looking back at what we did, I realise that this was not possible if the core working group didn’t feel empowered. Freedom and feeling empowered was our main fuel in this team. Our aim was to introduce a practice that creates a valuable resource for the wider company while preserving the autonomy of individual teams. We are looking forward to seeing how it develops.

Zettle Engineering

We build tools to help business grow — this is how we do it.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store