AI and Machine Learning Powered Correlations for Prometheus

Guy Fighel
SignifAI
Published in
5 min readJun 14, 2018

SignifAI is an AI and machine learning platform that correlates Prometheus Alertmanager events, metrics and logs to reduce alert noise and help administrators efficiently determine root causes and deploy solutions faster.

Prometheus is an open-source monitoring system that provides flexibility and efficiency with a dimensional data model, third-party integrations for statistics and visualization, and a powerful query language. Using a limited set of routing rules, alerts are directed to and distributed from the Prometheus Alertmanager, which can quickly become overwhelming among piles of notifications from other tools in your stack. At SignifAI, we believe there’s a better way to leverage Prometheus’ monitoring capabilities — with AI and machine learning powered correlations.

Correlating Alertmanager and Non-Alertmanager Data: How does one go about correlating Prometheus metrics with the other operational data being generated by dependent applications and services, like New Relic, Datadog, Dynatrace and ElasticSearch?

More Alerts, More Problems: Finally, how can one achieve “full visibility” and ensure you are not missing anything important without drowning yourself in alerts? No one wants to make the maintenance of alert grouping and notification rules a full-time job!

Challenges monitoring and managing Prometheus alerts

Manageability: In a large deployment, the last thing you want to do is maintain a sophisticated alerting rules engine to serve a large team of on-call engineers. Alertmanager configuration is managed via command-line flags and a YAML file. The YAML file contains the silencing, aggregating, notifying rules which must be configured and maintained. Prometheus offers a very basic visual tool to help in the editing of routing rules. This manually coding gets very clunky and unmanageable in dynamic environments with many metrics and end users that need to be notified when certain conditions are met.

Alert Noise: In large, complex IT environments there are always a variety of monitoring tools in use. This occurs when specialized tools are deployed to monitor specific elements of the stack, like the networking or logging, or simply because an Ops teams has a preference for one monitoring tool over another. This means that although Alertmanager’s grouping capability helps reduce Prometheus’ metric-centric “alert noise”, it does nothing to reduce the combined noise of alerts coming from other tools. This ends up providing very limited grouping capabilities overall if other tools are being used.

Inefficient Root Cause Analysis: Identifying the root cause of an issue in a complex environment often requires correlating data from multiple systems, multiple parts of the stack, as well as multiple, incompatible data types. Alertmanager has no ability to automatically identify and correlate relevant data from other systems, regardless of the datatype or timeframe. This is especially important when it comes to Operator Framework events which are critical in troubleshooting Kubernetes-native applications and infrastructure.

Dumb Alerts: Alerts that are generated by Prometheus are just that, alerts. There is no enrichment of the alerts with suggested solutions, KB articles or similar issues that have been resolved in the past that can help engineers get to an accurate solution quickly, enabling faster MTTR.

AI and machine learning powered correlations for Alertmanager to the rescue

SignifAI is a SaaS-based correlation engine that leverages AI and machine learning to help administrators identify root causes and solutions quickly in complex environments. SignifAI’s automated platform requires no previous AI or machine learning expertise to use. Connecting Prometheus Alertmanager, along with your other tools, to the SignifAI platform is accomplished via APIs or webhooks. There are no agents to install and no data tagging is required. SignifAI automatically transforms the data to find relationships, correlations, anomalies and trends. These results are then presented in a UI in the form of issue cards. Users also have the ability to input their own logic (“train the model”) to increase the accuracy of the correlations.

Reduced Alert Noise: SignifAI takes the complexity out of grouping Prometheus alerts by using AI and machine learning to automatically correlate related events and reduce alert noise. On top of that, SignifAI also automatically groups related alerts that might be coming from other monitoring systems. This ends up delivering real “alert noise” reduction because it takes into account all the monitoring systems in a complex environment, not just Prometheus.

Efficient Root Cause Analysis: SignifAI uses AI and machine learning to identify powerful correlations across all the operational data found in an complex environment. This give administrators the full context surrounding an issue that they will need to conduct efficient root cause analysis. SignifAI then enriches each issue with relevant data, regardless of the timeframe, datasource or data type, including Operator Framework data plus logs, events or metrics from other systems. Contrast this with Prometheus’ limited view of metrics (or Elastic logs).

Enriched and Actionable Alerts: SignifAI enriches each issue with suggested solutions, KB articles and links to similar issues that have been successfully resolved in the past. These enrichments enable engineers to get to an accurate solution much more quickly than by manually search for or formulating solutions from scratch. Intelligent and actionable alerts translate into faster MTTR regardless of which tools you’re using in conjunction with Prometheus.

What’s next?

Originally published at blog.signifai.io on June 14, 2018.

--

--

Guy Fighel
SignifAI

Co-founder and CTO @SignifAI. Thinking machine intelligence, software engineering, scaling infrastructure, systems automation. Previously @TangoMe, @Vonage.