Incident management is a crucial part in any service-providing company.
Its the process where a critical service disruption is managed from start to finish, all the while taking into account the following:
- In-time problem detection.
- Alerting the correct service owners of the problem.
- Coordinating the repair efforts.
- Alerting stakeholders (of all levels) of the problem and its effects on the company’s business and clients.
- Constantly updating upper management of the progress, keeping work transparent.
- And last, but not least, Root Cause Analysis of the problem that will contain what caused the issue to happen and what can be done to prevent it from recurring
In this post you’ll learn about the Incident Management process at Gett through its different stages and also, what are our goals in this area.
In-time problem detection (Proactive vs. Reactive):
Our goal is to know about any major incidents from stage 0; i.e. detect the problem at its very beginning, and not when it’s reported by a customer. For this purpose we several APM (Application Performance Monitoring) tools:
- We use NewRelic to monitor our micro services and different gateways.
- Grafana is used to monitor flow-specific events that exceed a pre-configured threshold.
- And finally, we use DataDog to monitor the hardware components of our system (Databases, Load balancers, etc…)
Additionally, each APM tool is linked to a number of notification channels (Slack, PagerDuty, Email etc…), so each member of our incident team is notified of any anomalous events.
“Our goal is to reach a quarterly proactive coverage rate of 90% of possible major incidents, and we’re almost there.”
Alerting the correct service owner of the problem
We use a micro-services strategy at Gett, and we’re talking about a lot of services. No single person can remember all service owners, definitely not when you wake up in the middle of the night to an alert 😉.
For this purpose, we’ve created an internal catalog of all service owners in the company along with backups, in cases of emergency.
This alone minimized the incident resolution time by 20%.
Coordinating the repair efforts
An incident with no designated person leading the efforts is a recipe for an ongoing disaster. Gett employs a Global Technical Support Team where each of its members is trained in managing an incident from start to finish. The team is dispersed across our regions of operations and runs 24/7 on-call shifts to cover all un-anticipated incidents.
Additionally, the Global Incident Manager acts as an escalation point for major incidents, is accountable to upper management for ongoing updates and provides a final Root Cause Analysis report per each major incident.
The incident management process comprises of:
- Reaching the correct service owner per issue.
- Coordinating the fix efforts.
- Making the tough decisions during the incident, and;
- Updating executive levels in a “what they need to know, when they need to know it” manner.
When the incident is completed and the system is restored to normal operation, the RCA (Root Cause Analysis) process begins.
- Initial RCA: A high-level description of the problem’s cause (DB issue, Infrastructural issue, Code issue, etc…) provided by the technical support agent that managed the incident (to be completed in a 24hr timeframe, post incident resolution)
- R&D RCA: This is the most important part of the RCA process and is completed up to 48 hours post incident resolution (We also have Jira widgets in place to alert the relevant Team Lead if this wasn’t filled in time). It includes a full technical root cause analysis of the issue, why it happened, what was done to fix it, why it wasn’t detected (missed by QA, no relevant monitor, etc…), what is the probability of this issue to reoccur and what can be done to prevent it from happening again.
- Action items: Once preventative measure for this incident are ascertained, they are created as sub-tasks directly from the incident, with the incident left open until all sub-tasks are completed. This motivates the relevant R&D team to complete the related issue as it “dirties up” their Kanban board with an open incident.
“It is the Global incident manager’s responsibility to make sure this process is enforced and completed after every incident.”
So now you know how we deal with incident management here at Gett! What is the incident management procedure in your company? We’d be happy to hear, receive feedback and mutually learn from each other.