A “Fire” Incident case study

Lior Avni
Lior Avni
May 30 · 4 min read

Here at Gett, we have several degrees for production environment related issues:

In the following blog post, I’ll review one such “Fire” incident we managed, how we analyzed it, what methodology we used to manage it, how it was resolved and what did we learn from it.

It was a day like any other day…

It started as a normal day, office spirit was high, it was mid-August and summer vacation was on with great weather all around. We were just about to go to lunch when the first call came from our Customer Care department: a customer called and complained that he can’t order a taxi as the app tells him he has an outstanding balance to his credit card.

As our internal monitors were not showing any signs of an anomaly, our first thought was that this is a temporary timeout between third-party components that could momentarily cause such glitches in service. Then, as the second, third and fourth calls came in, we realized something was up.

Initial investigation begins

So we started gathering facts and create a timeline for this event as it unfolded:

This is where the importance of the Incident Manager position comes into full effect. Our on-call incident manager took command of the incident and started executing our “Fire” protocol:

Root Cause ascertained

It was a nerve wrecking hour or so until the root cause was discovered. As with all incidents, a tiny semicolon was placed in the wrong location (and really, show me one developer who didn’t miss a semicolon in production…), causing a chain reaction that created this specific issue.

The fix was multi faceted:

The first part was easy, while the second part required knowing exactly when the original bug was introduced to the system. Thankfully, every deploy is logged with full details, so it didn’t take too long to find the correct time, recover the data from the correct point in time and restore the system to operational state. The third part involved our customer care department that diligently went from customer to customer and verified that everything is now working as it should.

This was a defining moment in Gett’s incident handling methodology as this was an incident that affected many of our customers. It was like a successful space shuttle launch when we understood that the incident was behind us: cheers all around, hugs and claps on shoulders with the VP R&D and division chiefs (that were all huddled in the war room together with the engineers, standing shoulder to shoulder and trying to assist in resolving the incident). This was a joint effort and we braved through it like a well oiled machine.

We live, We learn, We move on

The last piece of this puzzle, and also the most important aspect of the Incident Manager’s job, was the “post mortem” portion, where we derive the needed action items from this incident.

A meeting was assembled the day after where we:

Since that incident, we had additional incidents, but as we grew and matured, so did our incident management process. We now have extensive business monitors which allow us to proactively catch 90% of possible business affecting issues. We reduced our incident resolution time by 50% and our employees are all the more engaged in resolving these issues the minute they are triggered.

So, until we meet again, this is your friendly Gett Incident Management team signing off…

Gett Engineering

Code, stories, tips, thoughts, experimentations from the day-to-day work of our R&D team.

Thanks to Shai Mishali.

Lior Avni

Written by

Lior Avni

Global technical support & Incident manager at Gett. Working with customers for the better part of 20 years and enjoying every minute of it :-)

Gett Engineering

Code, stories, tips, thoughts, experimentations from the day-to-day work of our R&D team.