Here at Gett, we have several degrees for production environment related issues:
- 🔨 Trivial: A small issue that affects the occasional user in a remote aspect of the application
- ⚠️ Medium: slightly more inconvenient, but not business disrupting
- ❗️ Critical: a potentially business affecting issue if not treated within the next few days
- 🔥 Fire: A “Drop everything you’re doing and fix this issue now” type of situation. It has a critical effect on the business and users are currently experiencing severe degradation while using the service
In the following blog post, I’ll review one such “Fire” incident we managed, how we analyzed it, what methodology we used to manage it, how it was resolved and what did we learn from it.
It was a day like any other day…
It started as a normal day, office spirit was high, it was mid-August and summer vacation was on with great weather all around. We were just about to go to lunch when the first call came from our Customer Care department: a customer called and complained that he can’t order a taxi as the app tells him he has an outstanding balance to his credit card.
As our internal monitors were not showing any signs of an anomaly, our first thought was that this is a temporary timeout between third-party components that could momentarily cause such glitches in service. Then, as the second, third and fourth calls came in, we realized something was up.
Initial investigation begins
So we started gathering facts and create a timeline for this event as it unfolded:
- 🕐 When did it start?
- 🔍 A common denominator for the complaints?
- 🌏 Is this a global issue or narrowed down to one region?
- 🤔 And most important - “what is the cause”?
This is where the importance of the Incident Manager position comes into full effect. Our on-call incident manager took command of the incident and started executing our “Fire” protocol:
- Issue a notification to all critical communication channels. Gett operates in 3 global regions and this turned out to be a global issue
- Issue a notification to upper management updating them on the current rolling issue. In situations like this, the most crucial thing to a regional manager is to be in the loop and to have answers ready if required by external parties.
- Assemble the correct engineers in one place. A direct result of the service mapping that was done during the past year was an immediate alert to the relevant engineering team(s). An impromptu “War Room” was setup where all teams were huddled and worked diligently to resolve this issue as soon as possible.
Root Cause ascertained
It was a nerve wrecking hour or so until the root cause was discovered. As with all incidents, a tiny semicolon was placed in the wrong location (and really, show me one developer who didn’t miss a semicolon in production…), causing a chain reaction that created this specific issue.
The fix was multi faceted:
- Fixing the code and deploying the fix.
- Fixing the damaged parts of the Database that were affected by this problem.
- Dealing with the after effects of this incident on our customers.
The first part was easy, while the second part required knowing exactly when the original bug was introduced to the system. Thankfully, every deploy is logged with full details, so it didn’t take too long to find the correct time, recover the data from the correct point in time and restore the system to operational state. The third part involved our customer care department that diligently went from customer to customer and verified that everything is now working as it should.
This was a defining moment in Gett’s incident handling methodology as this was an incident that affected many of our customers. It was like a successful space shuttle launch when we understood that the incident was behind us: cheers all around, hugs and claps on shoulders with the VP R&D and division chiefs (that were all huddled in the war room together with the engineers, standing shoulder to shoulder and trying to assist in resolving the incident). This was a joint effort and we braved through it like a well oiled machine.
We live, We learn, We move on
The last piece of this puzzle, and also the most important aspect of the Incident Manager’s job, was the “post mortem” portion, where we derive the needed action items from this incident.
A meeting was assembled the day after where we:
- Reviewed the chain of events.
- Understood what was missing from our proactive monitors.
- Implemented these monitors.
- Issued a company wide root cause analysis for this incident.
- Added the needed internal safeguards to prevent this from happening again.
Since that incident, we had additional incidents, but as we grew and matured, so did our incident management process. We now have extensive business monitors which allow us to proactively catch 90% of possible business affecting issues. We reduced our incident resolution time by 50% and our employees are all the more engaged in resolving these issues the minute they are triggered.
So, until we meet again, this is your friendly Gett Incident Management team signing off…