Gett Tech
Published in

Gett Tech

A “Fire” Incident case study

Here at Gett, we have several degrees for production environment related issues:

  • 🔨 Trivial: A small issue that affects the occasional user in a remote aspect of the application

In the following blog post, I’ll review one such “Fire” incident we managed, how we analyzed it, what methodology we used to manage it, how it was resolved and what did we learn from it.

It was a day like any other day…

It started as a normal day, office spirit was high, it was mid-August and summer vacation was on with great weather all around. We were just about to go to lunch when the first call came from our Customer Care department: a customer called and complained that he can’t order a taxi as the app tells him he has an outstanding balance to his credit card.

As our internal monitors were not showing any signs of an anomaly, our first thought was that this is a temporary timeout between third-party components that could momentarily cause such glitches in service. Then, as the second, third and fourth calls came in, we realized something was up.

Initial investigation begins

So we started gathering facts and create a timeline for this event as it unfolded:

  • 🕐 When did it start?

This is where the importance of the Incident Manager position comes into full effect. Our on-call incident manager took command of the incident and started executing our “Fire” protocol:

  • Issue a notification to all critical communication channels. Gett operates in 3 global regions and this turned out to be a global issue

Root Cause ascertained

It was a nerve wrecking hour or so until the root cause was discovered. As with all incidents, a tiny semicolon was placed in the wrong location (and really, show me one developer who didn’t miss a semicolon in production…), causing a chain reaction that created this specific issue.

The fix was multi faceted:

  • Fixing the code and deploying the fix.

The first part was easy, while the second part required knowing exactly when the original bug was introduced to the system. Thankfully, every deploy is logged with full details, so it didn’t take too long to find the correct time, recover the data from the correct point in time and restore the system to operational state. The third part involved our customer care department that diligently went from customer to customer and verified that everything is now working as it should.

This was a defining moment in Gett’s incident handling methodology as this was an incident that affected many of our customers. It was like a successful space shuttle launch when we understood that the incident was behind us: cheers all around, hugs and claps on shoulders with the VP R&D and division chiefs (that were all huddled in the war room together with the engineers, standing shoulder to shoulder and trying to assist in resolving the incident). This was a joint effort and we braved through it like a well oiled machine.

We live, We learn, We move on

The last piece of this puzzle, and also the most important aspect of the Incident Manager’s job, was the “post mortem” portion, where we derive the needed action items from this incident.

A meeting was assembled the day after where we:

  • Reviewed the chain of events.

Since that incident, we had additional incidents, but as we grew and matured, so did our incident management process. We now have extensive business monitors which allow us to proactively catch 90% of possible business affecting issues. We reduced our incident resolution time by 50% and our employees are all the more engaged in resolving these issues the minute they are triggered.

So, until we meet again, this is your friendly Gett Incident Management team signing off…



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lior Avni

Global technical support & Incident manager at Gett. Working with customers for the better part of 20 years and enjoying every minute of it :-)