Reducing Incident Time with Incident Management Tools
At Gett we strive to resolve our “fires” (high severity interruptions to our platform) as quickly as possible. In 2022, 50% of our fires were resolved in under 90 minutes. A respectable statline, but with the majority of incidents taking longer than an hour to resolve there was plenty of room to improve…. “time is money” as they say.
However, it was not simply resolution time that we sought to shorten; We are also concerned with the service we, the Tech Support team, provided to external stakeholders during times of disruption. When I talk about external stakeholders I am referring to anybody who sits outside of the RnD and Product organisation, who uses the Gett platform and its underlying services. This can be both internal teams (Finance, Sales, Customer Care etc) and external customers (from the man on the street to large enterprises).
To give an idea of one such area for improvement, we regularly received feedback that communication during fires was sporadic or the language used was too technical to be of use to customer facing units. In addition, post-incident analysis was manual and time consuming, searching through slack threads or 1:1 conversations with team members to piece together a picture of who did what, when and how.
Therefore, we began to consider how we could improve our incident management processes Our objectives were:
- To improve communications for all involved.
- To reduce our resolution time
- To improve our ability to analyse and learn from Fires.
- To streamline Incident Management by removing manual efforts.
It was decided that a dedicated incident management tool would allow us to achieve these goals. The search for such a tool led us to Exigence, an incident management platform which offered many features, including:
- Centralising communications,
- Easy management of on-call schedules and escalation plans,
- Visualisation of the “War Room” with timelines, current members etc,
- Automation of manual tasks
- Insights via a reporting suite
In this article, I will describe how once we migrated to the platform, we were able to leverage these features to our advantage and progress towards our objectives.
Improving Communications
As I touched on earlier, communications during fires can be a pain point for stakeholders. Clear and regular updates during an incident are important for obvious reasons and also plays a vital part in building trust across business units.
As a team, we used a combination of Slack, WhatsApp and emails to send communications. Each channel reaches a different audience who expect varying levels of details. However, when you have to keep one foot in the war room, whilst diving into logs etc., with the best will in the world, it can be easy to miss publishing these regular updates or to change the level of technical language to each audience.
The new platform gave us the ability to send the same update across both slack and email to predefined email distribution lists (which we defined within Exigence and configured alongside rules which determine the scenarios the different recipients are engaged). The result is that the incident manager can send out updates to several destinations with a literal single click. No longer having to copy and paste over and over again. In addition, I created some templates to cover the majority of scenarios and the key stages of an incident. In this way, we could better update non-technical bodies and Engineers did not have to spend time carefully wording their updates.
Sharing The Knowledge
When looking at reducing our Time To Resolution (TTR), it will be of no surprise to hear that the vast majority of time during an incident is spent on investigation; In 2022, we spent over 6000 minutes getting to understand high severity issues.
We refer to this period as Time To Understand (TTU) which we define as the time it took to find the underlying reason for a problem since we were first notified.
Whilst an alert from a monitor can give an indication to the underlying cause and therefore reducing TTU, what can be done to reduce it further? Especially in cases when there is no alert to offer a starting point… Maybe if we had a way to retrieve historic incidents with some helpful pointers as to what was wrong and how it was fixed?
One feature offered by Exigence is the ability to create a library of past incidents, detailing their causes and solutions, which can then be suggested to a user — based on similarities between the historical and the current incident. For example, if users report an issue with opening the app, Exigence can present a user a historical case which contains the phrase “cannot open the app” and indicate the root cause and how the issue was solved last time. Whilst it isn’t a silver bullet, there is a real benefit of having such a library of past experiences to draw from. One has some clarity with a starting point for investigation as well as a list of potential fixes with a proven success rate.
In a similar vein we asked engineering teams to configure dynamic “To do” lists which could be triggered when certain conditions are met during an incident. For instance, if we receive an alert from our SMS service, we automatically open a task for the on-call team member to switch the service to use a different SMS gateway, along with documented steps to follow. In this way a member of the team who has not dealt with this particular issue before can simply follow the instructions without having to spend precious moments understanding where the problem is, asking around the team for advice, and so on.Coincidently, this feature complements our proactive monitoring cycle, which requires new features to be deployed to production with a runbook to follow if and when it becomes non performant.
Final thoughts
Looking back at our objectives I would like to think that we have made extremely positive steps towards reaching them:
- We have streamlined communications, removing manual work for the Tier1.
- We reduce time to understand through the use of a library of historical fires, their causes and solutions.
- We reduce time to resolve and head count involved in a Fire (reducing cost) via automated runbooks, giving issue specific steps to restore normalcy.
Implementing these changes is an ongoing process of adjustments, analysis and further tweaks. I am confident that in time we will see a definite improvement in time to resolve incidents and I will share my findings in a future article… to be continued!