AIOps — The Premise, Promise and the Prediction

Masaf Dawood
The Startup
Published in
9 min readFeb 22, 2020
Image Courtesy of Pixabay

Artificial Intelligence (AI) in general needs no media blitz, IT Operations (ITOps) on the other hand is the least sexy and often overlooked group within the technology operations. However the marriage of the two has created significant visibility for ITOps and the elevation of its profile.There has been significant and renewed interest in looking at the holy grail of downtime and efficient running of ITOps. Historically ITOps is all about keeping the lights on and making sure that the infrastructure and applications running on it performs as expected. This post is focused on understanding the context, evolution and direction of AIOps, while a subsequent post will take a deep dive into the current market offerings.

“We are what we repeatedly do. Excellence, then, is not an act, but a habit”. ARISTOTLE

What is this rage about IT operations and keeping the lights on? First, is the burn rate of ITOps, which has been steadily rising and consumes up to 70% of budget this is no longer sustainable, Second with the rise of digital first enterprises and apps, point tools and platforms are unable to handle the response, scale and inherent complexity. The measurements built in are more binary than granular (not to mention they completely miss on user experience). The compound effect is a muted and compromised client experience. When your end user calls to let you/help desk know that app is not working, database is down and can’t complete the transactions or generate the tableau report, the credibility of the ITOps team takes a hit. This is despite heavy spending on infrastructure, monitoring systems, and tools to keep the lights on. In a 2018 Everest Group survey of 200 CIO’s (with > 1B$ in revenues) they found :

  • 71% of the enterprises believed they lack a meaningfully scalable model for infrastructure growth
  • 73% of the respondents had included and identified intelligent automation as a key theme for infrastructure management as part of broader IT adoption strategy.

Digital transformation often leads to a Hybrid environment (at least in the interim) and that requires two different sets of tools, processes and response thresholds.This has set the stage for the rise of artificial intelligence for IT operations, or AIOps. At least that’s the premise.

Systems were simple, siloed and segmented

Image Courtesy of Pixabay

As we transitioned from a self hosted back office with localized operations, we were still in the back office mode as far as most of the end customers were concerned. Most of this productivity improvement and efficiency was targeted towards internal systems and rarely crossed the hard line towards the end customer. The advent and acceleration of cloud computing (2000’s) laid the foundation of this “Digital Divide” between legacy and current state.

As ubiquitous connectivity and high speed internet became mainstream, the pace of transformation amplified further with the advent of apps economy, rich media uptake and mobile transformation. For the first time consumers were ahead of the enterprise, and in control of the technology adoption, uptake and industrialization. The elements of Digital economy started to emerge piecemeal from the dotcom bubble (2000’s) with e-commerce players and distinct trends emerging, laying the foundations of modern day customer experience and interaction. Amazon patented its 1-Click service, which allows users to make faster purchases in 1999. Digital transformation fundamentals and foundational drivers started to emerge rapidly with e-commerce adoption and acceptance.

While the Harvard Business School (HBR) fremium model was very well received by the consumer for services and software, goods and services exchange needed platforms to transact with trust. The ability to transact and collaborate very quickly replaced 9–5 business day to a 24 X 7 model. This new medium’s interaction was primarily digital and existing suite of applications and infrastructure were not able to support and scale this new opportunity. Rather, legacy architecture stifled the rapid explosion and growth for many enterprises and some were driven out of business not by competition but by their own inaction.

Descent into Chaos … Local scale 2 Planetary Scale

Image Courtesy of Pixabay

This was further compounded by the explosion in the apps, interfaces, and devices. Let us look at Facebook, more than 1.39 billion people connect to Facebook’s infrastructure per month of which 1.19 billion are on mobile alone. Nearly 1 billion photos are shared and more than 3 billion videos are viewed every day. Facebook’s services run on top of hundreds of thousands of servers spread across multiple geographically separated data centers. None of this can be managed by human scale!….. Things will break and require care and feeding at a rate much faster than eyeballs can provide and hands can hit the keyboard. The evolution from on-prem/local to web scale and now to Planetary scale requires out of the box and unconventional approach to reacting and responding to events. Up time and digital user experience is the lifeline of customer service, which would have otherwise degraded into chaos. Thanks to data analytics and machine learning technologies, we have a possible breakthrough here. This would not have been possible were it not for Google and its team’s fine tuning and industrializing to scale of Google SRE approach. The focus of this post is not to cover SRE, but highlight the emergence of data science, machine learning and possibilities that emerged as a result. That is the promise!

The Promise….Single Pane of Glass (SPOG)

Image Courtesy of Pixabay

A Gartner research published in 2019 to augment decision making in Devops states, “The growing need for organizations to analyze vast volumes of data in enabling rapid application delivery makes manual decision making a key bottleneck in DevOps. I&O leaders must leverage AI techniques to make data-driven decisions and automate actions to ensure business agility and stability”. Gartner estimates that only 5% of all large enterprises are currently combining big data and machine learning (the heart of an AIOps platform) to support and partially replace monitoring, service desk, and automation processes and tasks. However, Gartner expects that number to jump to 40% of all large enterprises by 2022. If this comes true, AIOps will create a massive shift in IT Operations methodology and spending, and it benefits everyone to understand what vendors, products, and services make up the AIOps marketplace. Will there be a single pane of glass(SPOG)..? Perhaps as likely as finding a Unicorn in your neighborhood park! Silos between network, infrastructure, apps, servers, db, security, end user computing are deep,diverse and well fortified. Instead, focus should be on breaking the silos leveraging, business process availability (BPA) and subsequent digital experience monitoring (DEM) as key metrics. This can be made possible if we pivot to event based viewing, dynamic discovery, real time mapping, and event correlations vs./ monitoring tool or resolution based roles based triage. Single pane of glass (SPOG) constructs are, accurate telemetry, large amounts of data aggregation, optimal/minimal human input in the ack-react loop and noise reduction using algorithmic clustering.

Data as Crystal Ball Into Future State

Image Courtesy of Pixabay

What AIOps is to Service Management is what AI is to enterprise data..? What AIOps does is allow us to move us from reactive to preventive and finally to predictive. Goal of AI and current advancement is to apply the tools and techniques to data to prevent the inevitable, to predict the future possibilities within the use case context and help optimize the business process performance and functions. Now that we have collected a ton of data and have stored it successfully and safely, in the mile deep proverbial vaults, but haven't had the time or the tools to analyze and leverage its ability to act as a crystal ball for making future predictions. This is primarily the ticket data, but can be expanded to include all IT operations artifacts including, logs, rca, run books, monitoring/alerts, notifications etc. This data can help predict the future state health is the premise of AIOps!

AIOps Macro Trends — Possible Use Cases

Image Courtesy of Pixabay

While we are far out from realizing the benefits of full automation and movement towards NOOps the following are the patterns that have emerged as possible use cases that address not only the low hanging fruit but also provide a foundation for building AI/ML based IT operations practices.

  1. Prevent and Predict: An emerging use case is to predict the failure of the devops pipeline based on the release history, magnitude of changes and complexity of build etc. This avoids downtime toll as well as expensive regression testing.
  2. Anomaly/threat detection: Once the baseline behavior of the system is established, the AIOps tool watches for variance and flags outliers as they present. AIOps is a valuable addition to a strong security management posture. Heuristics and algorithms can mine network traffic or other threats that can take out a network. Subsequently if the anomalies represent the new baseline the mechanism allows it to update and revise its thresholds dynamically. This capability and subsequent use case is gaining wider traction due to rapid growth of cloud computing workloads.
  3. Event Correlation: Infrastructure teams are faced with floods of alerts, and yet, there is only a handful that are business impacting. AIOps can mine these alerts, use inference models to group them together, and identify upstream root-cause issues that are at the core of the problem. Often when an event occurs, multiple monitoring systems are generating alert storms and as a result, users are also opening up tickets that are related and subsequently can be triaged and tracked as one event.
  4. Intelligent alerting and escalation: After root-cause alerts and issues are identified, ITOps teams are using artificial intelligence to automatically notify subject matter experts or teams of incidents for faster remediation. Artificial intelligence can act like a routing system, immediately setting the remediation workflow in motion before a human being ever gets involved.
  5. Incident auto-remediation: AIOps is also being used as an end-to-end bridge between ITSM and IT operations. Traditionally, ITSM teams sift through infrastructure data to identify and remediate issues at the root cause. AIOps extracts root cause inferences from infrastructure alerts and sends them to an ITSM team or tool through API integration pathways.
  6. Capacity optimization: This can also include predictive capacity planning and refers to the use of statistical analysis or AI-based analytics to optimize application availability and workloads across infrastructure. These analytics can proactively monitor raw utilization, bandwidth, CPU, memory and more, and help increase overall application up time.

AIOps — The Path Forward

Image Courtesy of Pixabay

As complexity continues to mount, failure potential increases exponentially while the pressure builds for IT teams to deliver business services with minimal/zero downtime.

“The journey of a thousand miles begins with one step”. Lao Tzu

AIOps is emerging as both a leading-edge discipline to address the operational issues and in doing so effectively focus on true auto-remediation and root cause (vs repeat) resolutions. IT leadership and operations team have started to recognize the potential and are carefully integrating the use cases into their operational models with some early proof of concept/proof of value (POC/POV) delivering promising results. The competitive advantage in adopting and embracing AIOps is not purely from a resource unit/opex savings perspective, but has the potential to bring continuous innovation in the enterprise. The ability to predict, and future proof ITOps lifeline is very intriguing and offers a tactical use case. The potential of a rich payout and competitive advantage is very compelling. An eco-system of startups and established players has emerged with distinct and use case focused offerings that bring multiple approaches to solving this complex problem. Some of the key players in the emerging AIOps space are Stackstate, Ops Ramp, Opsani, Dynatrace, Sciencelogic, Moogsoft, Big Panda, SignalFX, Darwin AI. A subsequent post will cover the individual vendor capabilities and product focus areas.

References

  1. https://www2.everestgrp.com/reportaction/EGR-2018-29-V-2747/Marketing
  2. https://hbr.org/2014/05/making-freemium-work
  3. https://knowledge.wharton.upenn.edu/article/amazons-1-click-goes-off-patent/
  4. https://landing.google.com/sre/
  5. https://www.gartner.com/smarterwithgartner/how-to-get-started-with-aiops/

DISCLOSURE STATEMENT: Opinions are those of the individual author. Unless noted otherwise in this post, the author is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

--

--