AIOps: Unleashing the hidden insights in unstructured IT data for better IT operations management

GargiB
IBM Cloud
Published in
8 min readAug 11, 2021

Co-authors: Amit Paradker, Rama Akkiraju

“There is no future of IT operations that does not include AIOps”. (Gartner Market Guide for AIOps Platforms 2021).

This is due to the rapid growth in data volumes and a pace of change that cannot wait on humans to mine insights. IT data needs AI for insights, automation, and proactive actions.

The bulk of the data generated by applications that need operations management comes from multi-tiered distributed systems. Let us take an example of how much data is generated from a full-stack application.

An application of good to medium maturity, comprised of between 50–100 services, with multiple client deployments can generate 2000–10000 incidents in a year, out of which close to 100 of them could lead to customer downtime, disruption or major outages. The outages can occur because of problems at the infrastructure layer — i.e. with the server, network or storage, middleware layer, database or web server, or the application itself. The operational data generated around these outages can be anything up to 300 GB per hour, across logs and metrics. Traces help track calls across services and when turned on can generate 2–5 times more data. In addition, you need historical data of similar past problems, resolutions that worked, and logs and metrics, to effectively troubleshoot the problem at hand.

AIOps is about applying AI to optimise IT operations management. It involves monitoring the IT data generated by business applications across multiple sources and layers of the stack –throughout the development, deployment and run lifecycles– for the purposes of generating various insights. Sample insights that can be derived by monitoring and analysing this IT data include detection of anomalous behaviours, solving issues via root cause analysis on the anomalies detected, predicting outages before they cause revenue loss and ultimately moving towards automated and autonomous remediation.

Observability is a pre-requisite of AIOps. Collection and aggregation of multiple sources of data is based on design principles and architecting of a big data system. Without these two functions in place, AIOps is not executable. In this blog we focus on analytics and AI and the net-new techniques needed to derive insights out of collected data.

When applied to applications, AIOps deals with maintenance of the full-stack application. The main artefacts here can be categorised using two-dimensions:

  1. sources of the application data
  2. types of data emitted by each source

This is represented by the table below.

Figure 1

Servers, databases, applications, networks, and storage systems are all part of the multi-tiered architecture that AIOps needs to manage. These constitute the sources of data. Each source can emit different types of data i.e. incidents, logs, metrics and traces. Incidents include human-generated problem descriptions of what issue they see and on which components. Sometimes they contain actions too, i.e. resolution comments of how the problem was solved, what worked/did not work. Metrics reveal key health indicators of a component. Logs gives a peek into messages generated within a system, while traces reveal instances of API calls that span micro-services.

Metrics need structured/time-series analysis, while incidents, logs and traces need both structured and unstructured data analysis.

Analytics across the stack needs techniques that can meaningfully derive insights from these data sources and lead to faster actions.

In the cells of Figure 1, there are samples of Structured and Unstructured data and where they are generated in the stack.

As the multitude of sources generate data from multi-tier systems, analytics are needed across sources and across tiers. The previous generation of AI software focused on machine learning/deep learning algorithms within a stack or for one type of data source. There are many systems that specialise on anomaly detection for storage systems using metric anomalies, incident categorisation for server incidents, root cause analysis for databases using transaction logs. Each has its way of learning patterns and flagging deviant behaviour. Each layer flagging its anomalies in a disconnected manner suffers from the problem of information overload for the same set of issues.

However, many real issues manifest across the stack.

A storage pool, being over-subscribed, may cause disk space to be filled up, causing a database server to be hung and the application unable to perform a post request. Each of these anomalies can be captured through metrics or logs or traces in their independent layers of storage, database and application. Assuming each data source flags 10 anomalies for an ongoing issue, across the three layers of server, storage and application and 3 data sources of logs, metrics and traces, we would get 90 anomalies. However, when stitched together using the right associations, the 90 anomalies can be used to point to just 1 root cause event. This is the power of cross-stack anomaly correlation, which helps in grouping anomalies across the layers and eventually pointing to the root cause.

The question therefore is how to exploit the right properties in the anomalies that helps correlate and create causal links across these dependencies. The structured, semi-structured and completely un-structured data that is present in logs, traces and incidents contain descriptions of the problem, application behaviour details in terms of error codes and problem descriptions, server/virtual machine/container names, error codes, trace IDs.

Challenges with unstructured data, however, is that it is voluminous and ambiguous. In this blog, we assume that the right Big Data architecture has to be in place to exploit the voluminous nature of logs and traces. Hence we focus on dealing with ambiguity.

Let us consider a set of anomalies generated across the application stack where an issue is brewing. We assume state-of-art anomaly detection techniques have flagged them across different layers of the stack. The problem now at hand is to pinpoint one root cause event.

  1. A1: Anomaly from an application log: Time T1: Component 10.1.150.63: Message: Error 4321 App pharmacy order unable to perform post — session timeout
  2. A2: Anomaly from an database logs: Time T2: Component 10.1.150.51:Message: MSSQLSERVER(5084 — Server); Setting option OFFLINE for database payment_db
  3. A3: Anomaly from an server disk space metric: Time T3: Component 10.1.150.51:Message: File System /var/lib/mysql is full — Disk space free is 0 %
  4. A4: Anomaly from an storage available capacity metric: Time T5: Component 10.1.150.64:Message: vNX5200_QA Storage Pool_4 is oversubscribed and has used 98% of available capacity

What is the role of structured Information in these anomalies and what can we do with it?

Each of these anomalies has meta-data associated with it. Meta-data is mostly structured comprising of a timestamp (e.g., Time), a source emitter of the anomaly (e.g. Component) etc. It could also include the container pods, virtual machines, process names, or parent services where the anomalies are generated. However, the presence of structured data is limited, and depends on the use of other aggregator tools being used. Also, while the structured data correlation gives a good view of when and where things are happening, it does not give enough clues about the root cause of the issue.

Leveraging only the structured meta-data for pinpointing where the problem is, we could use (a) temporal correlations across the timestamps T1-T5 in the example above and (b) spatial relations across components i.e. 10.1.150.63/51/64, or service names or processes.

However, the structured data is never rich enough to capture the different relations across the anomalies. In addition, it has dependency on other tools to create these structured labels or meta-data.

What Unstructured data brings in additionally?

Beyond the structured meta-data, the message portion of data types such as logs and incident tickets is unstructured and contains much rich information, i.e. error codes, error symptoms (i.e., file system is full), conditions (setting option offline), metrics within text that denote problem conditions, detailed component names, regions. We can also use metrics extracted from the unstructured data in anomalies 3 and 4. This information can become an entity of interest in the domain.

If we could represent the information extracted –i.e., entities from different data types in the spatio-temporal space–this would effectively capture what is happening in the entire operations stack across the layers, beyond where the symptoms are noted. Relationships across these entities can be captured by temporal dependencies (co-occurs, happens before, happens after etc..), or topological dependencies like same host, processes running on virtual machines or microservices calling out to each other via api calls.

Challenges: Linking unstructured data has been a long-studied problem made hard by the nuances of human language. While IT data is not as open-ended, complex, and nuanced as day-to-day human language, it still can contain non-standard ways of expressing the same content. For example, similar entities could be expressed in different ways, different entities could seem similar but are not. Because of the number of words expressed in a log message feature space could potentially be very high-dimensional, smart linking of entities and reducing the feature space is an essential piece of this puzzle (a topic for a future article).

A possible solution: To link the rich data and pinpoint root causes?

One possible way to represent this contextual information of anomalies happening at different layers of the stack is via an issue-context graph. The nodes of the graph could be entities extracted from structured and un-structured data in the anomalies. The relations between the nodes could be temporal or spatial.

When anomalies occur within the same time-window they can be related using temporal associations. Step 2 shows such associations across A1 and A3. When entity mentions within anomalies refer to the same components (say via sameAs relationship in the topology) or same type of a component (say via isA relationship in the topology), or related components (say, via topological/geographical dependencies such as runsOn, dependsOn), they are called spatial associations. Step 3 shows such associations across A2 and A3, A3 and A2, A3 and A4.

In other words, the anomalies across the stack could be connected in the spatio-temporal space as a graph that represents the brewing issue. This is the issue context graph. The source anomalies could come from multiple systems specialised in detecting anomalies on metrics, logs, traces etc. Showing all data representations from anomalies as a graph creates a path for different learning algorithms to be applied. Unstructured data brings richness to the graph that otherwise would not have been possible with just structured meta-data.

The issue context graph has the advantage of being the base for many graph algorithms that can be used as downstream analytics on it. Some use cases include but are not limited to the following

  • Community detection/Graph partitioning algorithms to analyze several patterns of issues.
  • Centrality algorithms that point to the probable root cause
  • Link prediction and path finding techniques that help in outage prediction
  • Similarity: useful to find similar issues
  • Graph embeddings create learned representations of these issues

In summary, unstructured data has a lot of context and meta-data which when harnessed in a disciplined way can lead to more insights while dealing with troubleshooting and preventing failures of full stack applications.

--

--

GargiB
IBM Cloud

#IBM Researcher, Passionate Technologist, Diversity Champion. All views are my own. Hands-on Mom of two super-active boys