Incident Management at idealo

Gerrit Lutter
idealo Tech Blog
Published in
6 min readJul 13, 2019
Photo by Hamza El-Falah on Unsplash

Every day at idealo, we work hard to deliver a great experience for our customers and partner shops. A big part of this is ensuring the smooth operations of our products, making sure they are functioning correctly. However, with the complexity of technology and products, there will be glitches that we cannot anticipate. For these incidents, we want to make sure that we can fix them quickly, and generate learnings to make idealo even more robust.

Concepts for incident management have been around for a while. However, with the emerging DevOps approach, our focus has shifted from traditional operations departments to cross-functional teams that can build, ship, and also operate systems and products.

In such an environment, a different kind of incident management is needed to leverage the potential of cross-functional units.

At idealo, we have been developing and improving our very own incident process for a few years now. To find out more, I sat down with Oliver Effner, Agile Coach at idealo, who has been working a lot with this process.

Gerrit: Oliver, to start things off, tell us a bit about yourself and your connection to incident management.

Oliver: I have been working here as an Agile Coach for three years now, mostly with operations teams. It is standard for an operations unit to have an incident process in place.

And this is how I got involved in the whole topic of incident management. When I started, we had a very classical approach to incidents that was very much focussed on the operations side of things. For a platform like idealo, run with a DevOps philosophy, this wasn’t a good fit.

Gerrit: Could you elaborate a bit on what a classical approach to incident management means to you?

Oliver: The classical incident management approach looks like this: The IT service desk receives a malfunction message, classifies the malfunction, identifies the affected system, and begins handling the malfunction. This means typical administration activities, such as clearing the cache, cleaning up or extending memory space, restarting the system, and so on. This classical process, as it is e.g. described in ITIL [Information Technology Infrastructure Library] has a central location where incidents are reported to, the IT service desk. Here, tickets will be dealt with, often through 1st, 2nd, and 3rd level support. This is basically what existed at idealo, one central place to go. And then people at operations found the right level of support to deal with an incident.

Gerrit: And what were the challenges with the old process?

Oliver: The big problem was that it took a long time to locate the source of a problem. Also, operations wasn’t always in a good position to act, because of the DevOps End-To-End philosophy. They can re-start servers, but they cannot fix issues in the code. So they weren’t the right people to deal with an incident. On the other hand, there was no dedicated team to deal with incidents.

Gerrit: Galls Law states that “a complex system that works is invariably found to have evolved from a simple system that worked.“ What were the humble beginnings of the new incident process at idealo?

Oliver: We first got rid of the idea that incidents were reported to operations, and that operations were the ones having to deal with them. This is where we started. So not every incident was reported to operations. Instead, we installed a process to make the incident visible to the whole department. This allows us to then find the right people to fix the problem.

In case of a malfunction, the idealo approach is: report, engage, fix, and learn. We first make the malfunction visible for everyone in the department. One person will be responsible for managing the incident. This includes finding the right people, i.e. assembling a team to remove the malfunction. Once it is removed, we dedicate time to learn. This helps us to make our systems and our process better.

A difference to the classic approach can already be seen in step one. Instead of handing over responsibility to the IT Service Desk, the colleagues take care of the malfunction and organize the work themselves without central control.

Gerrit: You described how you shifted from a classical process to something new. Can you make out other development stages in the way the process evolved over time?

Oliver: Another big step was the distribution of responsibility across the whole Product & Technology department. The aim of our CTO was to have everyone report incidents and to help resolve them. This greatly increased the visibility of incidents, as many people are now watching the process. And incidents can no longer be ignored. What is more, it has become very easy to find the right people to resolve issues. Because we identify the systems responsible for an incident, and with it the people responsible for the system, we have much higher ownership for resolving incidents.

Gerrit: What are your personal learnings from working with and developing the process?

Oliver: In my view, the new approach works very well for a lot of situations without necessarily rendering the classical approach obsolete. So in this sense, the approach is complementary, rather than an alternative. Although we do focus more on the new approach at idealo.

Gerrit: What kind of challenges do you see for the future, that aren’t yet addressed by the process?

Oliver: The process is limited to our Product & Development department. This means that incidents that involve other departments cannot be dealt with the same energy. We also have room for improvement when it comes to learning from the incidents. We usually run post mortem workshops after an incident is resolved. This is where we ask ourselves how we might improve the process, and what we can do from a technical point of view to prevent future incidents. We have those findings, but we could do a better job at systematically looking at them, looking for patterns and making learnings available for similar incidents in the future. This is an area for growth.

Gerrit: If other organizations asked you for some advice, what are three things that you would tell them?

Oliver: Three pieces of advice … First, create an environment in which incidents are being made visible without necessarily needing to deal with them yourself. Second, the process has to be as simple as possible. Everyone must be able to understand how to act. Third, you need to show unambiguously that management wants this process to work exactly this way. So that if you have an incident, it will be dealt with with a high priority: it beats a deadline, it is more important than other work, than your daily work.

Gerrit: Is there anything else you’d like to share with our audience?

Oliver: This kind of process will not spread automatically throughout your organization, even when it is simple. People need to be trained. For us this means running workshops, one every month, where we train up to 12 people in how to handle incidents. This is how we systematically build knowledge in the organization. And it doesn’t hurt to refresh that knowledge, say after two or three years. That is really important.

Another thing is that inside the incident process, we have teams made up of people who don’t normally work together. Those teams are cross-functional, spanning over the whole Product & Technology department. This is crucial for effectively resolving incidents. At the same time, it helps colleagues to get to know each other better and to increase understanding across units. This fosters a culture of collaboration, which allows us to act faster when needed. It also helps in building bridges between operations and development.

Gerrit: Those are some great closing words! Thank you very much for this interview and enjoy the rest of your day!

Oliver: Thank you!

Please let me know if you found this article useful (👏🏻) so others can find it too, and share it with your friends. You can follow me here on Medium (Gerrit Lutter) or on Twitter (@gerritlutter) to stay up-to-date with my work. Thanks a lot for reading!

--

--

Gerrit Lutter
idealo Tech Blog

Agile Coach, Scrum Master, Coach, Mediator, Facilitator, Amateur Chef. Changing the world of mobility with my amazing colleagues at SHARE NOW.