AIOps — The new member in the DevOps Family

Soumyajit Dutta
Jul 13, 2018 · 11 min read

AIOps is the application of artificial intelligence for IT operations. It is the future of ITOps, combining algorithmic and human intelligence to provide full visibility into the state and performance of the IT systems that businesses rely on.

Successful digital transformation relies on AIOps to enable IT to operate at the speed that modern business requires. An AI Platform For The Next Decade Of IT. You can’t manage today’s dynamic, constantly changing IT landscape with yesterday’s tools.

The ongoing evolution of IT infrastructure models — moving from static and predictable physical systems to software-defined resources that change and reconfigure on the fly — demands equally dynamic technology and processes for its management.

As network infrastructures evolve, old model-based systems take more and more effort to maintain, yet still, fall further and further behind.

AIOps uses machine learning and data science to give IT operations teams a real-time understanding of any issues affecting the availability or performance of the systems under their care. Gartner first defined the term in 2016, positioning it at the intersection of monitoring, service desk, and automation.

What’s Driving AIOps?

The promise of Artificial Intelligence has been to do what humans do but do it better, faster, and at scale. AIOps will do this for IT Operations by addressing the speed, scale, and complexity challenges of digital transformation, including:

  • The difficulty IT Operations has in manually managing its infrastructure. It’s becoming a misnomer to use the term “infrastructure” here, as modern IT environments include managed cloud, unmanaged cloud, third-party services, SaaS integrations, mobile, and more. Traditional approaches to managing complexity don’t work in dynamic, elastic environments. Tracking and managing this complexity through manual, human oversight is no longer possible. Current IT Ops technology is already beyond the scope of manual management and it will only get worse in the coming years.
  • The amount of data that IT Ops needs to retain is exponentially increasing. Performance monitoring is generating exponentially larger numbers of events and alerts. Service ticket volumes experience step function increases with the introduction of IOT devices, APIs, mobile applications, and digital or machine users. Again, it is simply becoming too complex for manual reporting and analysis.
  • Infrastructure problems must be responded to at ever-increasing speeds. As organizations digitize their business, IT becomes the business. The ‘consumerization’ of technology has changed user expectations for all industries. Reactions to IT events–whether real or perceived–need to occur immediately, particularly when an issue impacts user experience.
  • More computing power is moving to the edges of the network. The ease with which cloud infrastructure and third-party services can be adopted has empowered line of business (LOB) functions to build their own IT solutions and applications. Control and budget have shifted from the core of IT to the edge. More computing power (that can be taken advantage of) is being added from outside core IT.
  • Developers have more power and influence but accountability still sits with core IT. In DevOps organizations, programmers take more monitoring responsibility at the application level, but accountability for the overall health of the IT ecosystem and the interaction between applications, services, and infrastructure still remains the province of core IT. IT Ops is taking on more responsibility just as digital businesses are getting more complex.

How Does AIOps Work and who are its parts?

  • Extensive and diverse IT data sources, from currently siloed tools and IT disciplines such as events, metrics, logs, job data, tickets, monitoring, etc.
  • A modern big data platform that permits real-time processing of streaming IT data. Examples include Hadoop 2.0, Elastic Stack, and some Apache technologies.
  • Rule application and pattern recognition that enforces leverage and/or discovers context while uncovering regularities and normalities in the data. These can be, but don’t have to be, specific to the domain.
  • Domain algorithms that leverage IT domain expertise (specific to one environment or at the industry level) to intelligently interpret and apply the rules and patterns, as dictated by an organization’s data and its desired outcomes. These algorithms make it possible to achieve IT specific goals like eliminating noise, correlating unstructured data, establishing baselines, alerting on abnormalities, and identifying the probable cause.
  • Machine learning can automatically alter or create new algorithms based on the output of algorithmic analysis and new data introduced into the system.
  • Artificial intelligence that can adapt to the new and unknown in an environment.
  • Automation, which uses the outcomes generated by the machine learning and/or AI to automatically create and apply a response or improvement for identified issues and situations.

AIOps works with existing data sources, including traditional IT monitoring, log events, application, and network performance anomalies, and more. All data from these source systems are processed by a mathematical model that is able to identify significant events automatically, without requiring laborious manual pre-filtering. The second layer of algorithms analyses these events to identify clusters of related events that are all symptoms of the same underlying issue.

This algorithmic filtering massively reduces the noise level that IT operations teams would otherwise have to deal with, and also avoids the duplication of work that can occur when redundant tickets are routed to different teams. Instead, virtual teams can be assembled on the fly, enabling different specialists to “swarm” around an issue that spans across technological or organizational boundaries. Existing ticketing and incident management systems can take advantage of AIOps capabilities, integrating directly into existing processes.

AIOps also improves automation, by enabling workflows to be triggered with or without human intervention. ChatOps capabilities makes existing automation and orchestration functionality available as an integral part of the normal collaborative diagnostic and remediation process. As machine-learning systems become more and more accurate and reliable, it becomes possible for routine and well-understood actions to be triggered without human intervention, potentially resolving issues before users are impacted or even aware of any problem.

How Does AI Help Human Operators?

The pace and volume of change demands automation of routine tasks, to preserve valuable human intelligence for less frequent, unpredictable, and high-value activities. AIOps combines automation of tactical activities with strategic oversight by expert users, instead of wasting the time and expertise of skilled IT Operations personnel on “keeping the lights on”.

The “AI” in AIOps does not mean that human operators will be replaced by automated systems. Instead, humans and machines operate together, with algorithms augmenting human capabilities and enabling them to focus on what is meaningful.

How to Integrate AIOps with your Current Tools

AIOps integrates with existing tools and processes, bringing together information, insights, and capabilities that were previously locked in disconnected islands. Companies are using multiple different monitoring tools in different places and for different purposes. Each one is valuable to a specific team or function, but that value is not easily available to other interested parties. Instead of engaging laborious tool rationalisation initiatives that try to shoehorn individual needs into one-size-fits-all solutions, AIOps enables individual tools to thrive by delivering seamless shared visibility across all tools, teams, and domains.

In the same way, AIOps improves and enables ITSM by ensuring that only real, actionable incidents are created and avoiding duplication. There is no need to discard the experience embedded in each organisation’s ITIL-based processes. Instead, AIOps addresses and removes many of the frustrations that users have with ITSM, due to the inherently sequential nature of ITIL.

Finally, AIOps brings automation into the fold as well, integrating orchestration and run books and making them directly available to operators as partial or full automation. IT organizations have typically developed large libraries of automated solutions over the years, but need to ensure that they are triggered only by the correct conditions. AIOps ensures that this is the case, minimising risk and maximising value of existing investments in automation.

What are the Benefits of AIOps?

The main benefit of adopting AIOps is that it sets IT Operations up to operate with the level of speed and agility that end users expect and require. Reliance on brittle model-based processes, increasing specialization into disconnected silos, and above all, too much repetitive manual activity, made it difficult for IT Ops to keep up with the ever-increasing pace and volume of demands on their time.

Advanced machine learning captures useful information in the backgrund and makes it available in context to further improve the handling of future situations.

What You Need to Know About AI & Machine Learning

The AI in AIOps is not general intelligence. Instead, a set of specialized algorithms are narrowly focused on specific tasks. Different algorithms can pick out significant alerts from a noisy event stream, identify correlations between alerts from different sources, assemble the correct team of human specialists to diagnose and resolve a situation, propose probable root causes and possible solutions based on past experiences, and learn from feedback in order to improve continuously over time.

Clustering and correlation is the most complex and crucial step, requiring multiple different approaches. A combination of historical pattern-matching and real-time identification helps IT Ops teams to identify both recurring and net-new issues. Raw monitoring events may be enriched by reference to an external data source, where available; this enrichment helps to deliver better correlation, as well as service impact information.

AIOps Key Features

Gartner’s Market Guide for AIOps Platforms lists eleven key requirements for AIOps platforms. To be truly valuable, an AIOps platform should have strong capabilities in all of these areas. Single-purpose tools will only be useful for very narrowly defined use cases.

  • Stored: ingestion and indexing of historical data
  • Streaming: capture, normalization, and analysis of real-time data
  • Logs: capture and preparation of text data from log files generated by software or hardware
  • Metrics: data to which time series and more general mathematical operations can be immediately applied
  • Wire Data: packet data, including protocol and flow information, captured and made available for access and analysis
  • Document Text Data: ingestion, parsing, and syntactical and semantic indexing of human readable documents
  • Automated Pattern Discovery and Detection: the ability to identify mathematical or structural patterns within data streams that describe correlations, which can then be used to identify future incidents
  • Anomaly Detection: the use of patterns to first determine what constitutes normal system behavior, and then to identify departures from that normal system behavior
  • Causal Analysis: root cause determination, using automated pattern discovery to isolate genuine causal relationships and guide operator intervention
  • On Premises: capabilities defined above can be delivered on customers’ premises, without requiring access to any remote components
  • Cloud: capabilities defined above can be delivered in the cloud, without requiring on-premises installation of any components

Only solutions capable of ingesting all of these data types, applying these different types of analysis, and being deployed according to customers’ requirements, are considered to satisfy all of Gartner’s requirements for AIOps platforms.

Who is Using AIOps?

Companies with extensive IT environments, spanning multiple technology types, are already facing issues of complexity and scale. When those are compounded by a business model that is heavily dependent on IT, AIOps can make a huge difference to the success of the company. Though these organizations may be in many different industries, they share a common scale, and a rapid and accelerating rate of change, as the need for business agility in turn creates more and more demand for IT agility.

DevOps Teams

Companies who are adopting a DevOps model, or have already done so, can struggle to maintain alignment between the different roles involved. Direct integration of Dev and Ops systems into an overall AIOps model smooths away much of the friction that can occur at that interface. By ensuring that Dev teams have better understanding of the state of the environment, and in turn that Ops have full visibility of when and how developers are making changes and deployments into production, this holistic view ensures the success of the overall project and the achievement of its goals of increased agility and responsiveness.

Cloud Computing

A move to cloud computing can bring its own challenges, especially at scale, where it may not be possible (or desirable) to move IT wholesale to the cloud. These hybrid models, incorporating various forms of IT infrastructure delivery, can be hard to operate. By delivering a holistic view across all infrastructure types, and helping operators to understand relationships that change too quickly to be documented, AIOps removes much of the risk from the operation of a hybrid cloud platform.

Digital Transformation

Digital transformation initiatives can be defined in many different ways, but one common factor is a requirement for more speed and agility. This is a business requirement, but IT needs to be able to operate at the speed that the business requires if it is not to become a bottleneck, preventing the achievement of the wider goals. AIOps removes much of the friction that can otherwise prevent IT from delivering the level of IT support that successful digital transformation projects require.

Where does AIOps Fit into the Modern IT Environment?

When looking at AIOps for the first time, it is not immediately obvious how it fits into existing categories of tools. The reason is that AIOps does not replace existing monitoring, log management, service desk, or orchestration tools. Instead, it sits at the intersection of these different domains, consuming and integrating information across all of them and providing useful output to ensure a synchronized picture is available from every tool.

These tools are each valuable in their own right, but it can be hard to access the right piece of information at the right time, as long as they remain disconnected. Hard-coded integration logic struggles to keep pace with the rate of change of modern IT environments. AIOps provides a much more flexible approach to assembling all of these different partial views into a single comprehensive understanding of what is actually important for IT Ops teams to know about.

What next for IT Workers like us?

IT Ops personnel have been slow to adapt to AIOps-like environments because, out of necessity, our jobs have always been more conservative. It’s IT Ops’ job to make sure the lights stay on and to provide stability for the infrastructure that organizational applications ride on. However, due to the trends listed above, more IT Ops shops (especially those in the Enterprise) will need to implement AIOps strategies and technologies in the near future.


Faun

The Must-Read Publication for Aspiring Developers & DevOps Enthusiasts

Soumyajit Dutta

Written by

I am a professional Backend and DevOps Developer I mainly work on NodeJS and Python based Backend. On Cloud I have my forte on AWS and Azure.

Faun

Faun

The Must-Read Publication for Aspiring Developers & DevOps Enthusiasts

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade