Why bankers need automatic umbrellas. The story of one AIOps implementation
Obviously, banks cannot exist without data monitoring. The pain points here are the same as for any large company with a complex IT infrastructure. This can cause errors and downtime is much higher. Even if automation processes are set up well, this is often not enough for large players in the banking sector. They want to not just continuously monitor their systems, be aware of their presence and causes of failures, but also fully rely on AI to make systems self-monitor, self-predict and self-correct. Plus, a large number of technical departments and disparate systems give rise to a flood of alerts, which are increasingly difficult to control. Then it’s time to open the umbrella. And better automatic.
Manual actions without active manuals
For one of our clients, a large Eastern European bank, despite the desire for automation, the human factor still played a huge role. In the event of an incident in an IT infrastructure or a digital product, the attendants performed a step-by-step manual diagnosis of problems and alerted the responsible engineers. Nobody followed the knowledge base updates, so new employees often made mistakes when following instructions. Although in the banking sector there is practically no room for error. Of course, with such dependence on a person, SLA suffered greatly, which is not just a pain point for the bank, but a very critical moment.
To everything else, the fragmented IT infrastructure with many technical departments each had at least one monitoring system, added to the complexity. Each system generated hundreds of alerts and bombarded all those responsible with them (sometimes even between departments). It was difficult to constantly keep the focus of control on each notification, their urgency and importance were leveled due to their large number.
So we had two areas of work:
· The first stage is umbrella monitoring, to increase the speed of incident resolution
· The second stage is the automation of manual processes to reduce the risks and costs of scaling the IT department.
When its “Raining events and alerts”, you need an Umbrella
“IT monitoring solutions that process information from individual components of the IT infrastructure can be scaled for the infrastructure as a whole — like an umbrella that collects and analyzes data from all systems that are within its scope”.
- HRE Experts about umbrella monitoring
A feature of our AIOps solution is its integration with any monitoring systems and obtaining data on the state of the IT complex in one window. By integrating the solution in the bank, we centralized data from different monitoring systems on a “single screen” so that engineers working with data could understand the big picture of what is happening. Based on this data, a single resource-service model was built, showing the health of all components of the entire complex in real time. Links between configuration items reduced the time to find the root cause of the problem and the type of failure. This freed and speeded up the time for specialists to solve the incident then waste it for searching for the problem reasons.
Moreover, new elements and links between them started to be added to the resource-service model automatically via the auto-discovery process. That is, IT specialists did not spend time on manual updates.
And this was only the beginning of the processes that the machines took over.
Then event-based automation was configured that reduced incident response time and number of errors when executing instructions. Problem reports were sent to responsible persons automatically Now the staff spent the working time on solving real issues. Interactions with knowledge base were also automated. The engineer got the necessary entry from the knowledge base in the automated mode.
Also, as I mentioned in my previous articles, Acure allows you to configure auto-rules and auto-actions up to auto-healing scripts, which we also did not bypass in our case and set up action scripts for recurring incidents of the same types.
Achieved results and conclusions
As a result of the implementation of our platform and process automation, we have achieved impressive results. We reduced time for processing of critical incidents from 25 to 15 minutes and the number of notifications per SRE engineer from 110 to 10. And so, these numbers most eloquently answer the question of the benefits of AIOps and umbrella monitoring for both business and IT staff.