Dyninno’s Incident Management: an Introduction

Published in

Dyninno

4 min readJan 29, 2024

Vladimirs Romanovskis, Incident management teamlead, Dyninno Group

Incident management (IM) at Dyninno Group is a relatively new process. When I joined to enhance it three years ago, I realized IM lacked effectiveness due to diffused responsibility. Join me on this 3-part journey, as we explore the beginnings and the inner workings of Dyninno Group’s incident management process, from the roles to the response strategy that lead to an effective resolution.

Rough Beginnings

A significant hurdle we faced was the lack of ownership for the incident management process. Initially, only Helpdesk members were involved to fulfill incident response function in first line. Upon my arrival, the monitoring system was in disarray, with alerts being unintelligible to anyone outside the originating teams.

My observation revealed that the Helpdesk, juggling their primary user support duties with incident management, struggled with this additional responsibility, affecting the quality of the process which was without centralized control.

Key Goals of IM

But first, let’s reiterate what IM is all about.

The key goal of IM is not just to restore operations quickly but to coordinate effectively for faster system recovery. Also, IM provides transparency for all stakeholders.

Corporations facing seemingly never-ending technical difficulties often result from teams making temporary fixes without implementing permanent improvements, like better monitoring or system resilience, due to competing priorities or lack of follow-up. This is where IM steps in.

Effective incident management varies with company size and system complexity. At Dyninno, the interconnected nature of individual team projects necessitates a cohesive approach, much like different teams being responsible for a car’s wheels and engine but having one goal — to keep the car running.

Defining the primary objectives of IM

From my background as a support engineer with a strict Service Level Agreement (SLA) to project management, I recognize the importance of understanding the current state to improve incident management.

The most challenging aspects for me personally were the required cultural shifts within the company and a change in mindset towards incident management.

It’s human nature to prefer the aforementioned quick, temporary fixes over comprehensive analysis and prevention, especially under heavy workloads.

My approach included analyzing outstanding issues, cleaning up unattended logged incidents in our internal issue tracking system, and establishing clear incident logging practices to ensure accountability and a unified understanding of each issue’s impact and root causes.

Centralizing Ownership and Accountability

A pivotal change was centralizing ownership and accountability for incident resolution to a dedicated team. I took on the main responsibility, ensuring incidents were reviewed and closed definitively, implementing preventative measures — to minimize (as you may know — in IT, we can’t guarantee anything) probability for incident with same root cause to happen again, also improved monitoring and more descriptive alerts.

In the beginning, I developed comprehensive training materials for the Incident management team to ensure consistency and readiness, and it is currently maintained by the whole team.

Incident managers are on call 24/7, but they’re not always at their desks, which is why we’ve tailored a 16-hour coverage during peak business times when most of the group’s business teams operate.

We distinguish between these 16 high-activity ‘hot hours’ and 8 ‘cooldown hours.’ We have transitioned to a dedicated Incident Management team to fully support the ‘hot hours’. During that time, Incident Managers work with alerts, confirm and verify reports, work with Security information and event management (SIEM) system, communicate with vendors and supervise activities within development hubs.

During quieter times, we rely on Helpdesk support to monitor alerts and escalate as needed for the Incident Management team review to confirm impact and kick-start incident management process.

Gradual Improvements

To address the aforementioned chaos in the monitoring systems, first step was to minimize alerting noise and focus the IM team on the important things first by alert delivery to a single ‘source of truth’. At first stage, by integrating the already existing monitoring tools, legacy systems and external vendor systems with the internal ticketing system.

Also, to garner quick wins, we simplified processes and improved information gathering through our existing logging tools. Such approach helped Incident Managers to oversee less systems and provide better reactions on high importance alerts.

We focused on defining and refining processes to handle incidents effectively. The key was ensuring adherence to these processes, as deviations prevent us from identifying and addressing gaps. Training helped align perceptions with the written procedures, essential for consistent incident handling.

As we’ve delved into the foundational aspects of Incident Management at Dyninno Group, we’ve only scratched the surface of its complexities and challenges. Stay tuned for the next installment, where we’ll navigate the intricate processes of streamlining and implementing these strategies. Discover how we transformed chaos into order in PART 2 of this series.

Dyninno’s Incident Management: an Introduction

Written by Dyninno Group