AIOps for AWS CloudWatch

Moogsoft AIOps
AIOps Blog
Published in
4 min readMar 15, 2019

Author: Itai Njanji

Customers migrating workloads to Amazon Web Services (AWS) can choose from an almost infinite variety of tools to monitor their infrastructure. All of these monitoring tools generate valuable insights from events and alarms from services such as Amazon CloudWatch and AWS Config. Customers can further drive more insights by applying machine learning techniques to this data. By using advanced machine learning techniques, customers can reduce operational incidents and increase their service quality.

When Traditional Ops Fail

Modern applications that are being re-platformed or re-architected on AWS require modern techniques to operate them. Traditionally, hiring support teams was directly proportional to the number of tickets generated (more or less). This outdated approach does not scale well with the distributed nature of the cloud and the rate of innovation which cloud computing enables. Simply put, the number of incidents occurring in AWS CloudWatch, as well as in all of the other tools that are already in use, can be very hard to predict, making the number of expected incidents a much less dependable metric for hiring and planning in IT Operations.

One traditional approach to managing the number of tickets is extensive manual curation of alarms. For instance, one common practice is to kill some alarms that users do not expect to need to receive on a regular ongoing basis. In this case, the risk is that important and rare alerts or insights might be missed because the condition in which they occur is unexpected. The machine learning techniques which are being gathered under the umbrella of Artificial Intelligence for IT Operations, or AIOps, can provide a new way to think about reducing the number of tickets and providing remediation advice to busy operators.

Automating anomaly detection in this way helps operations management teams separate signal from noise, surfacing significant events together with all of the context required to accelerate root cause analysis and incident resolution. Combining CloudWatch logs and metric data from existing monitoring tools and custom metrics developed in-house gives operation teams full visibility into the true extent of technical issues and their business impact. By analysing very large volumes of data, early warnings of incidents will be detected and routed to the right specialists, avoiding incidents and minimizing impact to end users.

What Happens to ITIL?

A question that often comes up at this point is, what happens to ITIL when AIOps is incorporated into wider IT Operations processes? There is no simple answer to this question, as customers implement ITIL differently. Generally speaking, AWS CloudWatch customers have found that utilizing AIOps will improve the quality of their tickets, making them more actionable, which in turns improves their ITIL processes — and users’ satisfaction with those processes. For example, a reduction in the volume of tickets means service desk members have more time to diagnose issues with AIOps insights, helping them achieve lower MTTR (Mean Time to Resolution). The time freed up can then be dedicated to more strategic work such as ITSM hygiene — perhaps making sure that the CMDB is up to date, or performing analysis of recurring issues and best practices.

What About DevOps?

IT is no longer just about Operations. New technical architectures and development methodologies are coming together to blur the distinction between the previously separate roles of Development and Ops, commonly called DevOps. AIOps helps teams working according to DevOps methodologies to deliver continuous service assurance as they accelerate their digital transformation drives.

The key ways in which AIOps can enhance DevOps and increase the return on companies’ investments are as follows:

  • Increasing CI/CD frequency: continuous assurance without the need for time-consuming, manual changes to infrastructure or extensive and intrusive instrumentation of applications
  • Improving service quality: automated early detection and diagnosis of issues ensures uptime and mitigates impact to the business
  • Reducing ticket volume: issues that require manual handling can be reduced by 40 percent on average, engaging the right teams automatically, and so reducing escalations

How Do I Start?

Operations leaders have to balance building their Operations Data Science capabilities and keeping the lights on for their end users. Moogsoft AIOps offers AWS CloudWatch customers an easy path to full operational AIOps capability, thanks to its Cloud Management Tools Competency certification. Integration between Moogsoft AIOps and other AWS tools such as CloudFormation helps to deliver a complete ML-enabled IT Ops toolchain to AWS customers, ensuring continuous assurance of applications and services. Moogsoft AIOps is available directly from the AWS marketplace for customers to evaluate and purchase.

Read More:

Disclaimer: This post is my own opinion, and not the opinion of Amazon Web Services or any organization I am associated to professionally. The goal of the article is to trigger intellectual and thought leadership ideas.

--

--

Moogsoft AIOps
AIOps Blog

Moogsoft AIOps is the pioneering AI platform for IT operations, powered by purpose-built machine learning algorithms.