Watson AIOps: AI for IT Operations Management

Rama Akkiraju
15 min readAug 5, 2020

--

At first, there were distributed computing systems, next, there were fault-tolerant systems, then, autonomic computing, and now, AI Operations. Someone once said that there is nothing new in Computer Science and that the same concepts keep coming back every few years. It’s like old wine being served in a new bottle!

Is it really? While the concepts and the vision for all these topics are the same, which is to have computer systems that are capable of self-management, the mechanisms, the means, and the standardization needed to achieve that vision fully are only coming in place now. Technologies such as Cloud computing, micro-service architectures, containerized software development such as docker and open-source container orchestration systems (e.g. Kubernetes) for automating computer application deployment, scaling, and management are all making it possible to have the necessary levels of abstractions needed to scale the self-management implementations. Even if the applications that are being managed themselves have not yet made their way to Cloud, the fact that the operations management solution can scale by building as containerized software managed by Kubernetes makes the solution more readily scalable to multiple environments. Also, the rise of Artificial Intelligence (AI) powered by the advancements in hardware architectures, Cloud computing, natural language processing (NLP) via language models such as BERT, and advancements in machine learning (ML) via deep learning (DL) algorithms and frameworks such as (Tensorflow, Pytorch) and deep neural network architecture optimization frameworks (such as Katib), has opened up new opportunities for optimizing business processes in various industries. Operations management of IT systems is one such an area that is prime for optimization. By leveraging the advancements in AI and Cloud computing we can now set out to achieve the vision of self-managing computer systems. That’s where AI for IT Operations management aka AIOps comes into the picture.

Information Technology (IT) Operations management is a vexing problem for most companies that rely on IT systems for mission-critical business applications. Despite the best intentions of engineers, good designs, and solid development practices, software and hardware systems deployed in companies in service of critical business applications are susceptible to outages, resulting in millions of dollars in labor, revenue loss, and customer satisfaction issues, each year. The best of the analytical tools fall short. This can be attributed to the complexity of the problem at hand. IT applications, the infrastructure that they run on, and the networking systems that support that infrastructure — all produce large amounts of structured and unstructured data in the form of logs and metrics. The volume and the variety of data generated in real-time poses significant challenges for analytical tools in processing them for detecting genuine anomalies, correlating disparate signals from multiple sources, and raising only those alerts that need IT operations management teams’ attention. To add to this, data volumes continue to grow rapidly as companies move to modular micro-services-based architectures, further compounding the problem.

AI can help solve these problems. AI can help IT operations management personnel/Site Reliability Engineers (SREs) in detecting issues early, predicting them before they occur, reduce event & alert noise by grouping events/alerts related to same incidents, locating the specific application or infrastructure component that is the source of the issue, determining the scope of incident impact, and recommending relevant and timely actions based on mining prior incident records. All these analytics help reduce the meantime to detect an incident (MTTD) and mean time to identify/isolate the cause of an incident (MTTI) and therefore, mean time to resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (lost revenue, penalties, opportunity costs, etc.) and indirect costs (customer dissatisfaction, lost customers, and lost references, etc.). Below, we describe the AI in our Watson AIOps solution.

The IT operations environment generates many kinds of data. These include metrics, alerts, events, logs, tickets, application and infrastructure topology, deployment configurations, and chat conversations. Of these, metrics tend to be structured in nature while logs, alerts, and events are semi-structured, and the content in tickets and chat conversations tends to be unstructured. Also, among all the data types, logs and metrics sometimes can be leading indicators of problems, while alerts, tickets and chat conversations tend to be lagging indicators. An advanced IT operations management system can take all of this data as inputs, detect incidents early, predict when incidents may occur, offer timely and relevant guidance on how to resolve incidents quickly and efficiently, automatically apply resolutions when applicable, and proactively avoid them from recurring by enforcing the required feedback loops into the various software development lifecycles. This can increase the productivity of IT operations personnel or Site Reliability Engineers (SREs) and thereby improve the mean times to detect, identify, and resolve incidents.

Enter Watson AIOps into the picture. It does exactly that!

IBM already has strong products in the market for event management, topology management, and metric-based anomaly prediction via Netcool Operations Manager product. These capabilities draw insights from alerts, events, and metrics. Building on these strong foundations, we have introduced Watson AIOps 1.0 in June 2020 that brings together insights from both structured and unstructured data types. Watson AIOps included anomaly prediction from logs, potential problem identification feature via fault localization analysis on topology, evidence and explanations to understand the problem, incident impact radius analysis to determine the scope of impact, and problem resolution suggestion via prior similar incident analysis. In Watson AIOps, insights such as anomaly prediction, the grouping of events, the probable cause of the incident, and next-best-action recommendations are all delivered in a ChatOps environment, such as Slack, a place where IT operations management personnel or Site Reliability Engineers (SREs) work.

With Watson AIOps for Cloud Pak for Data 2.0, we are bringing these capabilities even closer together as shown in Figure 1. Broadly speaking, Watson AIOps solution capabilities can be organized into event management, incident diagnosis, incident resolution, and insights delivery categories. These capabilities are supported by an ecosystem of connectors and platform capabilities to manage the AI model training, their lifecycle for improvements, etc. Below, we give a brief view of each of these capabilities whose overall flow is highlighted in Figure 2.

Figure 1: Watson AIOps for Cloud Pak for Data 2.0 Components.

Event Management

An event indicates that something that is noteworthy has happened in an IT operations environment. For example, an application has become unavailable or disk is full or disk reaching capacity, etc. Event management is the process that monitors and manages all events that occur through a business application or IT infrastructure. Event management involves event collection, event classification, event normalization, deduplication, event enrichment for analytics, event correlation, and event grouping either via manual rules or via automated means. The main goal of event management is not only to keep a record and manage the events but also to provide insights on those events that need operator attention either because they are likely to turn into major incidents or are already major incidents and action must be taken. The goal of event grouping, classification, and deduplication is to reduce the noise for IT operations managers and to help them focus on a few important events that need their immediate attention. Event Manager in Watson AIOps 2.0 offers all of the event management capabilities noted above. The AI Manager complements to this event grouping via entity-based correlation of events. The entity-based event grouping extracts entities, i.e., mentions of application and infrastructure component names that are referenceable via topology and correlates them to further inform the event grouping. These entity-mentions also help in isolating the faulty components, and in determining the incident impact scope as well.

In Watson AIOps 2.0 we bring together the capability to group events generated from structured, semi-structured, and unstructured data types. These include anomalies detected from metrics, logs, and tickets themselves respectively. We use multiple algorithms such as Temporal, Spatial, and Association Rule mining algorithms in Watson AIOps for event grouping.

Static and Dynamic Topology Management

Application and network topology refers to a map or a diagram that lays out the connections between different mission-critical applications in an enterprise. Static topology refers to a map that is constructed based on the build and deploys information on applications and infrastructure components. Dynamic topology, on the other hand, refers to a dynamic map that captures the resources and their relationships as the environment changes at run-time and provides near-real-time visibility of the same. Another important aspect of a dynamic topology is the ability to compare the current topology with a historical one. Real-time and historical views of the environment give answers to “What happened” & “What’s happening”, and to know the details that led up to an incident and see the topology (and status) changes over time. Watson AIOps’ Topology functionality is offered via Agile Service Manager. It supports observation and discovery of the application and infrastructure dependencies, regardless of type, vendor, or source. Topology Manager also supports cross-layer application and infrastructure dependency mapping where the information originates from distinct, disjoint sources of truth so that the solution provides a comprehensive application and infrastructure dependency mapping up and down the stack. This topology functionality is intimately integrated with AI Manager and is leveraged in entity-correlation based event grouping, and in faulty component visualization and fault impact radius estimation.

Incident diagnosis: Incident diagnosis involves identifying incidents early via anomaly prediction, isolating the faulty component, and determining the impact scope. Watson AIOps offers all of these capabilities. We examine them below briefly.

Anomaly prediction:

The goal of anomaly detection and prediction is to detect anomalies from logs and metrics. An anomaly is something that deviates from normal, standard, or expected behavior. Typically, organizations set either static thresholds or manual rules to define and manage deviations from normal behavior. These rules are usually set on log aggregation systems (such as LogDNA, Splunk, etc.) and metric monitoring systems (such as SysDig, Prometheus, etc.). The problem with status thresholds is that first, it takes a long time for subject matter experts (SME) to distill them from their experience and to create them and second, they don’t adapt to changes and therefore, tend to get outdated and irrelevant quickly. If not, updated or deleted, these manual rule-based anomalies can start to flood SREs with irrelevant alerts. In our experience, approximately, 30% of these threshold events are never actioned! Operations teams waste time and effort in managing these thresholds and end up missing important clues. Therefore, learning what is normal, baselining it, and using it to automatically detect anomalies can free up SME time from having to manually manage these rules. Watson AIOps offers anomaly detection from both metrics and logs.

· Log Anomaly prediction: IBM’s Watson AIOps’ state-of-the-art and multi patent-pending log anomaly detection technology, available in AI Manager, is capable of automatically parsing IT application and infrastructure logs from log aggregation tools such as LogDNA, automatically learning normal log patterns from training data, understanding their semantic meaning, and detecting anomalies in real-time much sooner than traditional thresholding-based or error-string-matching type of alerting techniques can, thereby significantly reducing the meant time to diagnose an incident. We use deep-learning algorithms to both prepare features from logs during log parsing and to make anomaly predictions. Users don’t have to set static thresholds or manual rules to detect anomalies. The system will automatically detect these anomalies. The obtained anomalous results are then explained with back pointers to specific log messages in which anomalies were noted. We have applied this log anomaly detection system to an IBM’s own CIO office run internal field management application for sellers to track their incentives. In a specific test we did, by analyzing the Apache server logs, we were able to detect anomalies up to 20 hours on an average across five different major incidents, before a human opened incident tickets. In this experiment, training was done on one week's worth of aggregated access and error logs to represent normal or no major impact on business. Major incidents corresponding to these anomalies were not detected by any rules or existing thresholds and hence were missed till a major incident actually occurred and an IT operations management person created a ticket for these.

· Metric Anomaly prediction: Watson AIOps metric-based anomaly detection, available in Metric Manager, analyzes metrics data from various systems such as New Relic, AppDynamics, and SolarWinds, etc., to automatically learn the normal behavior of metrics in your company and automatically detects anomalies from metrics. It employs a set of time-tested time-series algorithms such as Granger Causality, Robust Bounds, Variant/Invariant, Finite Domain and Predominant Range, etc. to capture seasonality, significant trends and do perform forecasting. Many metrics are seasonal. For example, what is normal for the metric pattern at 2 pm in a time zone may not be the same normal for metrics at 10 pm in that same time zone. Therefore, taking seasonality of a particular environment is critical to accurately predicting anomalies. The Metric Manager in Watson AIOps is equipped to do this. In a specific evaluation scenario, our metric anomaly predictor caught the problem two days before a server stopped collecting data and was rebooted as a result. In another evaluation scenario, our solution was able to detect memory leaks five days before the memory maxed out on the server and prevented an outage.

Fault Localization & Blast Radius

Entity mentions are the names of the resources (e.g. service or application component names, server names, server IP addresses, pod ids, node ids, etc.) that are referenced in anomalous logs, alerts, tickets, and events. Once events are grouped and the entity mentions in anomalies, alerts and events are extracted, we perform entity resolution with topological resources to isolate the problem and to place the identified entities on the corresponding dynamic topology instances that match the time at which the mentions were noted. This enables us to map identified faults on topology. Traversing the topological graph in the application, infrastructure, and network layers enables us to map out the impacted components.

Incident resolution

Watson AIOps ingests and mines prior ticket data to provide timely and relevant action recommendations for the currently diagnosed problem at hand. Current incident symptoms are framed as a query to the indexed ticket data to not only search and retrieve top k relevant prior incident records but also important entity-action (aka noun-verb) phrases are extracted from each relevant record to make it easy for SREs to get a quick glimpse of the suggested action. For example, from a long chat conversation that is pasted inside the ‘closing comments’ section of an incident record, we extract phrases such as ‘Scaled Compose data node’, ‘Restarted Analytics pods’. In the first phrase, ‘Compose data node’ is the entity and ‘scaled’ is the action. In the second phrase, ‘Analytics pods’ is the entity, and ‘restarted’ is the action. We apply various natural language processing techniques to extract entity and action phrases including rule-based systems.

Insights Delivery and Action Implementation

In Watson AIOps, all of the insights described above are delivered both via ChatOps and dashboards. Real-time, in the moment insights, are delivered via ChatOps to SREs directly in the place where they work. Within ChatOps, there is functionality to interact and share selected incident resolution suggestions with other collaborators, in addition to exploring the evidence of the insights. From ChatOps, SREs can launch log, metric, and ticket monitoring tools to explore further details. Similarly, SREs can launch interactive dashboards powered by Event Manager, Metric Manager, and Topology features for detailed exploration of events, event groups, metric anomalies, and topology. Applicable actions/runbooks can then be automatically run via Runbook execution.

Quality Evaluations

Capabilities such as Event Manager, Metrics Manager, and Topology are already fielded in many clients’ environments. Therefore, we focused our performance evaluations on the new AI-infuse capabilities offered through AI Manager. We applied Watson AIOps analytical pipelines to various internal IBM applications and services to test-drive some of the latest feature functions in AI Manager. We also tested some of the newer AI capabilities such as log anomaly predictor, entity-linking based event grouping, and incident similarity capabilities on some of our clients as part of the beta testing the product. Our results indicate that we achieve significant reductions in the meantime to diagnose and mean time to resolve incidents. In some instances, we detected anomalies 20 hours ahead of a human creating a ticket, in other cases, we have reduced mean time to resolve incidents from 6 hours to less than 15 minutes. We are excited about the time and cost savings we are set to deliver to our clients.

A note on AI model life-cycle management

The AI models in Watson AIOps are unsupervised machine learning models. They don’t need labeled data but they do need data to learn the normal behavior of metrics and logs and to index and analyze prior incident ticket records. Therefore, Watson AIOps takes a representative set of metrics, logs, and ticket data for training and building models. Watson AIOps models are set up to learn continuously using up-to-date data from your environment and to improve based on user feedback. Watson AIOps is not a black box AI-infuse solution. We believe in full transparency of the inner workings of our AI models. While Watson AIOps is set up to automatically retrain the models at regular intervals, IT operations administrators have access to our model (re)train scripts and can execute model retrain on demand at any time.

Figure 2: Watson AIOps at a glance

What’s next for Watson AIOps?

While we plan to continue to enrich the various AI pipelines mentioned in this article continuously in Watson AIOps, we are excited to bring together the enterprise-grade event management, predictive insights, and dynamic application topology management capabilities that you are already familiar with from Netcool Operations Insights portfolio with the latest AI-infuse machine learning and natural language processing capabilities to mine the unstructured data sources such as tickets, logs, and chats to offer an unparalleled IT operations management solution for our customers. We are looking forward to expanding our ecosystem of input connectors by integrating with various log, metric, and ticketing vendor products in the remainder of 2020 in an effort to bring out-of-the-box value to our customers. Netcool Operations Insights already offers more than 150+ connectors to various open-source and vendor tools. We look forward to bringing them all together in Watson AIOps onto a single stack and to further expanding this set in 2020. Similarly, on the output front, we are expanding ChatOps platforms from Slack to Microsoft Teams and other platforms. We believe in delivering insights where SREs work, which increasingly is noted as ChatOps environments. So, we will continue to invest in improving our user experience and explanations within ChatOps. However, we do realize the value of rich dashboards that offer interactive what-if analysis exploration, decision support, and off-line analysis of what has happened. Therefore, throughout the rest of 2020 and beyond we will continue to bring these user interfaces together to allow for users to seamlessly traverse both to derive the insights they need and to perform the actions they need to perform.

Furthermore, in the next generations of our Watson AIOps solution, we envision, self-aware and autonomic IT operations environments that not only shift-left in development-security-operations (DevSecOps) life cycles to influence deployment, test, build, code and design processes but also close the loops with operations phase with feedforward and feedback mechanisms. By doing so, we intend to fully equip each stage in the DevSecOps life cycle with full foresight, hindsight which enables intelligent, and consequence-aware decision-making at each stage. Our vision for shifting-left in DevSecOps life cycle, while closing the loops virtuous feedback and feedforward cycles for efficient operations management is shown in Figure 3. We envision various stages of IT application development processes to be equipped with the smarts to proactively avoid issues from happening at run-time by not advancing IT application artifacts that do not meet the preset quality criteria to the next stage. For example, smart checks and gates prevent risky deployments from getting pushed to production, stop under-tested code modules from getting into deployment phases, and block code with risky security vulnerabilities from getting to the deployment phase and so on. We envision Watson AIOps solution to correlate past incidents with root causes that could be traced to under-tested deployment changes, security vulnerabilities, poor code test coverage, and such. This information, when fed back, serves as a critical input to reinforcing the checks and gates in the earlier stages of DevSecOps life cycle.

Figure 3: Shifting-left in DevSecOps life cycle while closing the loops virtuous feedback and feedforward cycles for efficient operations management

So, after all, we do a comeback full circle to fault-tolerant, autonomic distributed systems with Watson AIOps. It’s just that this time around, we have the compute powered by Cloud Computing, the state-of-the-art AI algorithms, thanks to the advances in Machine Learning and Natural Language Processing, standardized platforms for building scalable management systems via Docker, Kubernetes, and standardized data and AI management platform to build solutions powered by IBM’s Cloud Paks. To add to this, we have a wealth of IT operations management experience at IBM from having managed IT systems and infrastructure for our customers via various strategic outsourcing engagements and have the depth of product experience with our Netcool suite of products that have been in the market for over twenty years. We are bringing them all together with a vision toward optimizing IT operations management, not just in a reactive mode but to avoid issues from happening in the first place by designing the DevSecOps lifecycle activities for efficient operations right from the get-go. We can’t wait to shape the future and take you all with us, on this journey!

Acknowledgments

A big shout out to all the global cross-organizational leaders and team members of IBM’s Watson AIOps team for all their wonderful contributions! You know who you are! There are too many to list here. Thank you!

--

--