Streamlining and Implementing Incident Management at Dyninno

Published in

Dyninno

5 min readFeb 9, 2024

Vladimirs Romanovskis, Incident management Teamlead, Dyninno Group

Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies. Understanding the nuances of this transformation is critical, as it not only enhances our operational efficiency but also shapes the very fabric of our organizational culture.

Specific Protocols for Different Scenarios

We’ve developed a clear incident management structure, providing team members with specific protocols for different scenarios. This contrasts with the previous approach where they managed incidents independently.

Incident reporting can originate from various sources, such as end-user reports or monitoring alerts. We identify incidents from other issues not only based on impact scope, but also on other criteria, like:

Impacted environment;
ASAP change requirements;
Suspicious activities within communication hubs, like a team requesting another team to check something ASAP, or new issue discussions starting inside past incident chats. They trigger Incident Managers’ attention to investigate deeper and understand what is going on, and, in cases of incident confirmation, kick-start the recovery and resolution process.

Confirmation. Recovery. Resolution.

When an incident is identified, the team assesses and verifies the issue through issue replication, workflow checks, related monitoring tool alerts and log analysis.

We immediately address it and involve the responsible teams to begin working towards full recovery, providing regular status updates to stakeholders via various announcement channels and collaborating with teams of experts.

Major incidents prompt cross-team collaboration to gauge the impact and coordinate recovery efforts, which includes coming up with an action plan for service restoration. Following successful recovery, confirmed by experts, other stakeholders (reporters) and monitoring tools, the incident is resolved and marked for future reference. On some occasions, to double-check the resolution confirmation requires additional time, therefore the Incident Management team transitions the incident into a monitoring status for an agreed period of time. During this time, status updates are less frequent until the incident is completely confirmed as a resolved.

Subsequently, we engage in problem management process through regular past incident reviews to ensure thorough documentation and to plan preventative measures. Quality assurance is overseen by incident managers and reviewed by team leads for completeness. We keep track of preventive task resolution and when they are implemented — the incident is permanently closed.

We recognize, however, that not all incidents require extensive preventive actions, especially if they are infrequent, minimally disruptive or with a very low probability of happening again. In such cases, it’s acceptable to close the incident without additional measures implemented.

Unified Approach to Incident Awareness

We introduced a dedicated status page to display ongoing incidents, leveraging our internal issue tracking system data for automatic updates. It enhances transparency across the company. The status page and incident notification alerts within our internal instant messaging platform keeps stakeholders informed about system statuses which is vital to keep everyone on the same page and foster better understanding of the importance of IM and its processes.

Communication templates were established to convey incident information in business terms, making it accessible to all employees, not just the techies. Automation was introduced to aid the incident managers, simplifying tasks like ticket formatting and single point for information provisioning

Unifying Monitoring and Alerting Systems

In the second stage, the focus was on implementing a custom monitoring system tailored to the status of health of our business applications. This system gathers data from these applications and allows for setting alert conditions based on various criteria, such as threshold values and alert duration. Additionally, it integrates seamlessly with our existing systems, enhancing the overall workflow without imposing extra burdens on the Incident Managers.

Shifting Mindsets

Challenges arose in motivating internal teams to adopt additional responsibilities for process improvement. We aimed to shift the mindset from quick fixes to long-term solutions, encouraging reporting and handing over monitoring activities to Incident Managers. IM team’s KPIs were designed to track:

registered incident count;
handover time to the responsible team;
incident formatting quality.

They inadvertently led to misconceptions about increased workload rather than team motivation at first, but later convinced the team of the need for them and overall contribution to the incident management process.

We’ve reached an agreement with senior management that a portion of development time should be dedicated to technical debt and system improvements. These tasks, often invisible in terms of immediate business value, can lead to more robust and secure systems in the long run.

Monitoring improvements being made by the expert teams for better alert understanding helps not only incident management team to understand nature of the alert better, but also for expert team members themselves. By making at least small changes, the overall contribution to the Incident Management process’ constant improvement is big, and I am thankful to our technical teams for that.

Training for an effective IM culture

Cultural changes also involved educating teams about the importance of incident management through presentations and Q&A sessions, with the support of our CIO. Training the Helpdesk was essential to align everyone with the goal of improved incident recognition and response.

It was crucial to encourage teams to report incidents rather than bypassing the process. We’ve made significant progress, as teams now actively engage with incident management for issues beyond their direct control, recognizing the value and support it provides for themselves.

Also, Incident Management offers teams the chance to hand over monitoring activities to an external party (Incident management team and Helpdesk) and therefore — focus more on their business tasks.

Having navigated the complexities of implementing streamlined Incident Management processes, what lies ahead is our final installment about the crucial role of metrics and future enhancements, including AI integration. Join us as we venture into advanced analytics and futuristic solutions, shaping Incident Management at Dyninno Group.

Streamlining and Implementing Incident Management at Dyninno

Written by Dyninno Group