Acing your incident management process

What we do at Crunch — Episode 1

As I’m sure most of you are aware we practice Continuous Deployment (CD) at Crunch. This inherently carries a lot of risks if it isn’t mitigated effectively. When we first started, a lot of the team had concerns about whether this was the right decision, including myself. However, I think everyone now agrees that it has improved how we do things and allows us to deliver value to our customers much earlier than we’ve been able to in the past.

We’re not finished on this journey yet, and we’re still working on establishing CD across our entire product base. We do make mistakes occasionally, of course — every human does! But what we do is make sure that we learn from every one of them and we become better as a team at ensuring we don’t make the same mistake again.

In this series of blog posts, we will talk about some of the processes and techniques we have in place at Crunch. We’ll kick this off with talking about our incident process. This has become something fundamental at Crunch, not just with products that we practice CD on but also across everything the Technology team here look after.

What is an incident?

We purposefully keep the definition of an incident quite loose. We understand that it’s often very subjective, and so we look at everything on a case by case basis.

We consider any event that negatively impacts either our clients or client facing departments within Crunch as an incident. An example of this would be a product outage or if users are unable to login to one of our products.

When an incident is raised, we immediately assess what we believe to be the impact. Generally, a major incident would be something that affects a significant percentage of our client base or significantly detracts from our ability to serve our clients. A minor incident would generally impact a smaller percentage of our client base.

Due to the above, we often rely on our Product Owners to decide whether it’s an incident and what they believe the impact to be. Sometimes this is very obvious (e.g. downtime on one or more of our products) but often it’s more difficult to assess (e.g. not able to download business guides on the website).

Immediate incident response

An incident is raised by a JIRA ticket, which can be created manually or automatically by emailing a specific email address. Most of our internal stakeholders know how to do this so it doesn’t just rely on the Engineering team to raise an incident.

Once an incident has been raised, all major stakeholders and the Engineering team are notified. Synergy (the team in charge of the infrastructure) immediately starts investigating the cause and talks to the team that’s responsible for maintaining the product in question. We then make a decision on who should own the work required to restore normal service.

After normal service has been resumed, which on average is around 20 minutes after the incident is raised, the ticket is moved into governance, where the Scrum Master of the team responsible will ensure all relevant stakeholders have been updated on the status of the incident. Alongside this, a timeline is created, including cause and solution, along with a post mortem being booked.

Post mortem culture

After every incident we have a post mortem. This is a meeting that involves everyone who was part of resolving the incident. The post mortem is used to discuss the cause and response to an incident. Whilst also coming up with some actions to help prevent it happening again in the future and to improve how we respond to incidents.

Typically a post mortem will follow this agenda:

  1. A brief explanation of what the incident was and how it impacted us.
  2. The timeline created is read through so everyone understands what happened.
  3. The causes of the incident are discussed, and actions are recorded on a spreadsheet.
  4. The response to the incident is discussed, and actions may be recorded dependent on how effective the response was.

The actions created are given a rank (discussed below) and assigned to the team responsible for addressing them.

This is a very important moment for us to inspect and adapt what we’re currently doing and we find these sessions very useful.

We hold them for every incident, however big or small, even if it only lasts a short amount of time and we already know what we need to do. This is because it enables everyone involved to be in the same room together to ensure that everybody is on the same page with the next steps. Each post mortem usually lasts somewhere between 20 minutes and an hour, depending on the severity of the incident, as well as the number of actions that get discussed to help prevent the incident reoccurring.

Ensuring actions are addressed

This is often the hardest bit of the process, especially when a product team needs to address an action. To stay competitive in our industry, our teams are required to continually improve the client experience, meaning new features are ranked highly.

This is very understandable. However, even with the drive to get new features implemented, we need to make sure that we don’t leave technical debt and incident actions to one side. Otherwise, we may find ourselves in a situation where major incidents could happen more regularly, ultimately leading to our clients having a poor experience.

To address this, every action to be raised post-incident is given one of the three following priorities:

  • Critical - Needs to go into the current sprint (if the team is using the Scrum framework) and should be addressed immediately.
  • Normal - Gets ranked in the Product Backlog accordingly and goes into a sprint where there is capacity.
  • Low - Is usually just a quality of life improvement, typically involving the process — hence this usually can be completed without disrupting the product teams.

We haven’t been in a situation yet where a critical action hasn’t been addressed within a week of being raised, which is excellent since these actions usually try to stop the incident occurring again in the near future.

However, we’re currently reviewing our process for Normal actions since we find that they are usually the ones that get outranked by user-facing features, so often that they’re not getting done.

Our Product Owners are onboard with this process, and fully understand the reasons why we need to do this.

How are things now?

When we first started recording metrics for the number of incidents occurring (around the beginning of 2018), we were having quite a few. These were also often re-occurrences of previous incidents since we hadn’t properly analysed how we could ensure it didn’t happen again. The graph below demonstrates how the number of incidents has changed over time:

You can see that recently the number of incidents has been greatly reduced with many weeks having no incidents at all. This is partly due to the hard work of everyone in the Engineering team to ensure our incident process is closely followed and the actions are being addressed.

As mentioned in the post mortem section, we have a spreadsheet that we use to record every action, and my fellow Scrum Masters and I regularly meet to ensure we’re keeping on top of the outstanding actions. We also ensure that the spreadsheet is up to date and is available to everyone who needs to be aware of it.

Conclusion

We’ve done a lot of work to ensure that our products have as few incidents as possible. We’ll always strive to improve our processes whilst striving to complete the actions that are outputted.

If you’re reading this and have any suggestions for what we could do, or indeed have any questions, please feel free to comment on this article.

We understand that we aren’t going to be in a situation where we’ll be completely free of incidents, but we’ll continue to work towards having as few as possible.

This is only one of the steps we’ve taken to mitigate risk as a department and, in future articles, we’ll be going into more detail about some of the other things we’re doing to achieve this.

Please follow us and give this article a round of applause if you’d like to see more articles like this.


Written by Samuel Catt, Scrum Master at Crunch. In his spare time Samuel is a competitive English Pool player taking part in a number of competitions.

Find out more about the Technology team at Crunch and our current opportunities here.