Incident management at major sporting goods e-commerce

Valentin Martin
Decathlon Digital
Published in
10 min readApr 24, 2024

In the fast-paced realm of incident management, the absence of a structured process can invite chaos. Before implementing our Incident Management process, we encountered significant challenges. Especially during multi-teams & highly critical incidents, when the resolution is expected to be quick.

One of the main obstacles we faced was the lack of incident classification. Without a clear method for categorizing and qualifying incidents, it was difficult to effectively involve the relevant teams with the right priority, when they did not know the severity level. Each problem seemed unique, making coordination and resolution complex.

Additionally, our internal communication was hindered by the lack of consistency in the first message formatting, in our incident slack channel. Indeed, the level of detail and the way to write down the message relied a lot on the whistleblower.
Also, because of having a single thread per incident, exchanges were fragmented. Multiple leads were explored in the same thread making it challenging to track discussions and make informed decisions.

Finally, we had trouble extracting statistical information, as simple as the number of incidents, the breakdown by criticality, the MTTR*, etc

This inefficiency impeded our ability to swiftly resolve incidents and minimize their impact on our operations.

In this article, we’ll explore how we overcame these challenges by implementing a robust -yet still to improve- Incident Management process. From incident reporting to resolution, via the use of classification methods and coordination workflows, we’ll share the lessons learned and best practices we’ve developed to ensure effective incident management.

*MTTR: Mean Time To Recover

Midjourney vision of an IT incident manager in a sport company (AI generated)

Then, we attended the SREcon22, where we saw several companies using that kind of process, to improve internal collaboration, and even automate some actions during the incident management.

After that, we tested a SaaS solution, plugged into Slack, designed to create, edit, and handle incidents directly from Slack commands. Unfortunately, we were not mature enough to use all the provided features, and the investment was too big.
Anyway, this test was still inspiring for us, and we managed to use Slack workflows to implement the main coordination actions.

The key to coordination: Use a powerful collaboration tool

The main part of our new process stands in the usage of a collaboration tool. We’ve already been using Slack for many interactions, and even for incident management.

But we were not using it to its full potential, and we significantly improved when we started to use Slack workflows (Note: This is not an advertising article for Slack or Slack workflows ;))

Consistent incident announcements

The use of Slack workflows to handle incidents made it possible to easily have a consistent message each time an incident is declared:

A screenshot of a slack message for incident declaration, showing structured information

This is the kind of message that is posted in our global incidents channel (something like #ecommerce-incidents)

Here you can easily find the information you’re looking for (priority, impacted country, …) because it’s always in the same place in the message

One dedicated channel to work together

In the posted message, we can see the Slack workflow has created a dedicated channel (“Join the incident”), starting with “inc-deca” then the date, the priority, and ending with the short description.

This dedicated channel is where to explore leads and manage investigation and resolution with all the contributors.

In this dedicated channel, the first message provides guidelines on how to handle the incident:

Slack screenshot of the first message in the channel, giving guidelines to incident commander and on-call engineers

As you can see, for example, we recommend using 1 thread per lead or topic.

This way, you can discuss multiple topics at the same time, without polluting the channel. (For example, one to check the database usage, another one to find out the proportion of affected users, etc)

After the incident, this will help write down the timeline in the incident report (aka post-mortem).

Defining role(s)

We also implemented a basic way to tell contributors who is in charge of the incident (Incident Commander), by clicking on a button (see above) that triggers a message in the room.

Screenshot of slack message announcing the incident commander

This is a first step toward defining precise roles.
We usually assign the “Communication lead” too (but without using Slack workflow features).

Easily store ideas for later

Another small tip is using meaningful emoji reactions on messages, once again to ease the incident report writing.

Screenshot of conversation showing message “Idea for later : Need alert on having multiple API instances response time too high.”, with a lightbulb reaction

Indeed, Slack search gives the possibility to retrieve them all afterward (with the “has:” filter)

Screenshot of search result for the search “in:#inc-channel has::bulb:” showing messages with the :bulb: emoji reaction

Let’s zoom out and see the whole process

This Slack workflow is where most of the fun takes place, but it’s a small part of our end-to-end incident management process.

This incident management process has multiple major steps, inspired by other articles, talks, or peer discussions.

Wide image showing steps of the incident management workflow. From left to right, the major steps are : Detection, Creation, Classification, Troubleshooting, Resolution, Review

First: Classify, prioritize, and declare your incident

The first step of the incident journey is, of course, the detection. We won’t throw light on this here.

image zooms on step 2 and 3 : Creation (with step “Incident form in ITSM tool”) then Classification (with step “Prioritization / Categorization done by ITSM tool”)

Once a team has detected the incident, they start to create and classify the incident.
To do so, we use our ITSM tool, which provides multiple fields to declare the incident.

Image of the creation form in ITSM, with multiple fields to fill : Start time, affected Country, Impact, Criticality, Priority(calculated), affected User Journey

The Impact is a score from 1 to 5 on how the feature is impacted. For example: if we talk about the payment module, the higher the payment option is used (credit card/gift card/Paypal), the higher the Impact score will be.

The Criticality is a Low-Medium-High rank on how the feature is critical for Decathlon overall. Each rank has a hidden numerical value.

We have a third parameter (not visible in the screenshot above) to provide the approximate proportion of users impacted (“Few” / “A lot“ / All), also corresponding to a hidden numerical value.

Then, we multiply the 3 values, giving the Priority score (grayed field in the image), from 0 to 300, corresponding to a P5 to P1 priority rank.

Second: Trigger the war room

For P4 and P5, there’s no need to disturb people and make them stop what they’re doing. The creator of the incident transfers the incident to them using the ITSM tool. The affected team will take care of it later.

image zooms on step 4: Troubleshooting. If High Prio incident, “Fill homemade triggering page” that is doing the 2 other steps “Trigger slack workflow” and “Send communication”

For P1 to P3 incidents, we consider that the resolution needs to be prioritized, and we need to make people work together quickly.

We have developed a basic front end, based on Google Apps Script, that can either :

  • run the Slack workflow (the one we described above)
  • send a communication via our communication tool
  • do both
Screenshot of the homemade tool. One input field to get “incident id”, and a select field “Communication method” with options : -“Send only Slack war_room” -”Send Statuspal communication” -”Do both”

Based on the previously created incident ID, the script will fetch the information in the ITSM tool, and use it to send the communication and trigger the Slack workflow with a webhook (Webhook trigger is one of the options to start a Slack workflow).

Third: Solve the incident and plan the post-incident actions

Zoom on workflow parts : end of Troubleshooting (with step “Coordination in Slack”) then Resolution box with step “Incident closure in all tools”, then “Review box” with steps “Post-mortem and action plan”

Now it’s the time to solve the incident, by exploring leads and collaborating in the Slack channel or a Google Meet.

Even when we use a video call, it’s still recommended to take notes in the channel, to keep history.
Indeed, the history of messages is -by default- not saved in Google Meet, but more importantly, you’ll also need a place to store screenshots (of error pages, logs, metric dashboards, …).

Once the incident is solved, you can thank everyone for their involvement. We encourage you to send kudos to people who were particularly committed to the resolution.
And of course, your users will be grateful to know the incident is solved, so please communicate as soon as the situation is stable.

Depending on the criticality of the incident, you’ll want to write a post-mortem (aka incident retrospective, lessons learned), to improve the systems, the monitoring, or the incident management.
(Remember that the Slack channel can be a wonderful source of insights)

Now what?

This process made us stronger when responding to incidents, but also mature enough to see improvement leads.

Main outcomes

Stock picture, showing a person shooting a basket ball, a who will score (representing the success)
  • The mean time to acknowledge (aka time for the responsible team to start working on the incident) has decreased from 15–20 minutes on most incidents, to 5–10 minutes.
    Meaning we’ve earned 10 to 15 minutes per incident.
  • The time to communicate has also improved. Before that, the incident management team had to create the incident in the ITSM tool, post a message in Slack, and then communicate it to our users.
    They often needed to write different messages each time, losing much time duplicating information.
    With this process, the information is instantly sent to Slack and our users at the same time, after the creation in the ITSM tool.
  • For the post-incident report: each message is timestamped to facilitate writing the timeline. People can add reactions to messages to mark them as important.
    This way we can easily extract action plans based on people’s thoughts they had during the incident.
    (Be honest: during incidents, you often think “Hey, if we had this, it could be way easier”, but you don’t say it because it’s not helping at that time, and you forget it when the post-mortem comes)

Limits and improvement leads

Stock picture, showing a person on the floor with their bike on the floor too. It seems they fell (representing the failure)
  • Today, even if a monitoring system does the detection, we still need a human to create the incident and trigger the war room.
    We’re working on making this automatic for major incidents detected by monitoring.
  • The classification is still highly subjective. The impact and criticality scores depend on who is creating the incident.
    We will make it easier to set these values (eg: the criticality can be set for each user journey/feature to avoid this)
  • As we said, after creating the incident in the ITSM tool, you still need to use another tool to trigger communication and war-room management. That’s because our ITSM tool cannot push events on incident creation/modification. We need to find another way to do it.
  • We might want to improve the way we handle tasks. Today, we can send messages in the channel to ask people to do some tasks. But these are only messages, and even if people can easily use emoji reactions on the messages, this can be forgotten. And even if they do it, we cannot fetch the timestamp.
    Having a more effective task manager inside Slack will help. This can be done using Slack Apps we could develop.
  • Some actions also need to be done periodically, like communication. We can imagine having a new task created each hour and assigned to the designated communication lead so we don’t forget to keep our users informed.
  • Speaking of communication lead: We don’t have a clear role management. Today, it’s the same as for tasks: we assign roles thanks to messages, but it’s better to keep track of who’s in charge of each role.
  • Post-incident management: We would like to make life easier for people who write the incident retrospective.
    The timeline and the action plan could be written automatically (at least partially) thanks to the messages and reactions (For example, detecting the “:bulb:” messages to mark them as lessons learned or actions to do.)
  • Finally, about the process itself: the process is used a lot at the e-commerce team level. But some other teams (like our Business Capabilities Platform teams, our e-commerce catalog provider, …) are not familiar with it, and it can be difficult for them to adapt: we need to decide if we want this for the whole company (and I think we should), and help the surrounding teams to use it.

Key takeaways

Stock picture, showing 2 persons running in a mountain scenery
  • Classify your incidents. Whether at the team or company level, you need to have a way to define a priority/severity, for easier comparison between incidents, and provide incentives for the team to work on them accordingly. (At company level is better, of course, but it’s sometimes hard, depending on company size)
  • Have a consistent way to list them. Declaring them in your Slack channel can be helpful to work on them. But once it’s finished, you’ll never be able to make stats on the number of incidents, their severity, and resolution times.
    Usually, companies have an ITSM tool that is a perfect fit for that.
  • Use collaboration tool(s) for cross-organization incident management. Communication tools like Google Chat, Slack, or Microsoft Teams increase the possibility of digging into incidents, with several people working on it, on multiple leads at the same time.
    Even some incident management tools are cleverly connected to these collaboration tools, making them even more powerful (as an example, incident.io or Jeli have developed effective features to improve incident resolution)
  • Finally, don’t be afraid to start small and improve your process step by step; a big process can frighten people and slow the adoption. A smaller process is easier to implement and will enhance adoption. One of the hardest things is the adoption of the process, and making the Software Engineers use it. Indeed, it’s not always an easy process if you’re not used to incident management.

Tell us if this is something that could fit your needs.
We’re also very eager to know if you’re using that kind of process, and if you managed to handle the benefits and the flaws of them!

Incident management process designed and article written with @sylvain.herail

--

--