Refining Incident Management with Metrics and Looking to the Future at Dyninno Group

Published in

Dyninno

3 min readFeb 20, 2024

Vladimirs Romanovskis, Incident management Teamlead, Dyninno Group

In our final chapter, following the establishment and streamlining of Incident Management (IM) at Dyninno Group covered in the previous articles, we now turn our focus to the future. Here, we will delve into the strategic use of metrics and the integration of advanced technologies like AI, which represents not just a change in operations, but a leap towards a more proactive, efficient, and future-proof organization.

Strategic Use of Metrics

We have established key metrics for incident management. Firstly, we began tracking the monthly incident count to identify gaps in our awareness. Secondly, we measured the proportion of incidents detected by monitoring tools versus other sources, like end-user reports.

We also started evaluating incident resolution times and the implementation of preventive actions, with the latter ensuring that issues would not recur.

This move aimed to transition from quick fixes to a more robust, future-proof system.

To measure the process’s efficiency, we’ve established SLA metrics, focusing on response times to monitoring alerts and end-user reports. This helps ensure immediate action on incidents rather than delaying until the next business day.

We also track handover times, detailing the timeframe from incident acknowledgment to escalation. Data on ‘time decomposition’ illustrates how time is spent during an incident’s lifecycle, emphasizing the importance of swift detection, reporting, reaction, handover, and resolution.

Problem Management and Resolution

Through centralized control, we discouraged teams from independently creating or closing incidents, instead promoting a standardized, manual review process to ensure thorough resolution. This rigorous approach to incident management took considerable effort to establish but has shown improvement, with added awareness support from upper management like Q&A sessions.

Deming (PDCA) cycle for Problem Management

Within problem management — I conduct regular team meetings to review incident trends and required actions. Experts bring solutions for permanent fixes, and the IM team proposes additional improvements for better workflow and user experience. The IM team gathers information and suggests preventive actions, such as improved monitoring, especially when incidents pass undetected by monitoring tools until stakeholders notice.

Future Enhancements and AI Integration

We continuously work on improving incident response times, appointing resolution leads, and enhancing this process with an incident management bot, enabling instant incident resolution chat participation through generated links for seamless stakeholder involvement.

Artificial intelligence aids in identifying anomalies, preempting incidents by analyzing trends and deviations. This predictive capability is a considerable time-saver.

We’ve begun collaborating with security operations, foreseeing a future where incident management plays a pivotal role in security support.

Automation remains central to our strategy, facilitating efficient incident identification by use of monitoring tools, logging and resolution. We’re committed to refining these systems to continue ensuring robust and responsive incident management.

Throughout this series, we have journeyed from the inception to the refinement of Incident Management at Dyninno Group. For a more thorough understanding of our IM ecosystem, please check out the previous articles:

PART 1: Dyninno’s Incident Management: an Introduction

PART 2: Streamlining and Implementing Incident Management at Dyninno

Refining Incident Management with Metrics and Looking to the Future at Dyninno Group

Written by Dyninno Group