Stop counting incidents, learn from them!

Showpad Engineering
Product & Engineering at Showpad
3 min readMay 9, 2019

There was a time that we measured quality based on the number of severe incidents on the platform. Less severe incidents means good quality, makes totally sense. So we build a whole program around it, called “2000 hours without an incident”. The objective was to keep the platform free from severe incidents for 2000 hours. Our best performance ever was 1,634 hours which is quite an achievement knowing that we do multiple production deploys per day.

I was so proud we started up this project and also on the achievements until I read the book “Accelerate” from Nicole Forsgren Phd. Everyone who is facing the challenge of scaling high performing engineering teams should read this book. The book explains that the absolute number of incidents is not your key metric, but it’s Mean Time to Restore (MTTR).

“If everything seems under control, you’re not going fast enough.” — Mario Andretti

When I told people we no longer consider the absolute number of incidents as a key metric, they were surprised and asked me: “Is quality not a priority anymore for engineering?” Yes, quality is important! But it’s not the end goal. It allows us to go faster, it allows us to ship business value more frequent. If you want to go fast, failure is inevitable. Important is how quickly you can restore the service in case of a severe incident.

So we started to measure the MTTR for sever incidents and added an extra question to the incident retrospective. Besides the question: “How can we prevent the incident”, we also ask ourself : “How can we resolve the incident faster”. The results after only 4 months are astonishing, the MTTR is reduced with 50%! And we also discovered correlations between overall software quality and the MTTR metric. Software parts with an overall lower quality level have a higher MTTR. The bigger the MTTR, the more improvements can be identified during the retrospective.

“In complex adaptive systems, accidents are almost never the fault of a single person. Rather, accidents typically emerge from a complex interplay of contributing factors” — Nicole Forsgren

Each retrospective turned out into some kind of air crash investigation. Incidents will keep on happening but the root cause will move from obvious human mistakes to a complex interplay of different factors.

Learning is key in the whole process. Incidents are an opportunity to make your platform robuster, it’s a learning opportunity. Outcomes of retrospectives are now presented to the 100+ engineering team on a regular basis to share the learnings, to share the risks, to share the improvements, to ask for feedback. It turned out the best move ever.

You will learn more from your failures than your successes — so embrace those mistakes, as difficult as that sounds, and grow from them. When a project is successful, you’re never really sure why, because so many elements come into play. However, when you fail, you always know why. That is how you learn and grow — Lynda Resnick

--

--