Monitor and Manage Alerts on OpenShift with AI and Machine Learning

Guy Fighel
SignifAI
Published in
5 min readJun 7, 2018

SignifAI is an AI and machine learning platform that correlates OpenShift events, metrics and logs to reduce alert noise and help administrators efficiently determine root causes and deploy solutions faster.

Red Hat’s OpenShift is a Platform as a Service (PaaS) for developing applications rapidly and deploying them at scale — with the least amount of friction and the greatest amount of automation. Because OpenShift’s underlying platform makes use of Docker and Kubernetes to run your applications and services, this can pose unique monitoring challenges if you’ve primarily dealt with monolithic applications in the past. At SignifAI, we believe there’s a better way to monitor OpenShift, with AI and machine learning powered correlations.

While OpenShift’s tooling does provide some basic metrics like container health and resource usage, many administrators quickly realize that OpenShift creates a new set of monitoring problems they need to contend with.

Detailed Metrics: How does one go about getting detailed metrics without overloading the containers with instrumentation?

Correlating OpenShift and Non-OpenShift Data: How does one go about correlating OpenShift metrics with the other operational data being generated by dependent applications and services, perhaps not even running on OpenShift? Data from tools like Prometheus, New Relic, Datadog, Dynatrace and ElasticSearch?

More Alerts, More Problems: Finally, how can one achieve “full visibility” and ensure you are not missing anything important without drowning yourself in alerts? No one wants to make the maintenance of alert grouping and notification rules a full-time job!

Red Hat’s recommendations for OpenShift monitoring and managing alerts

Red Hat recommends configuring Prometheus Alerts in conjunction with a Grafana dashboard to collect, visualize and alert on OpenShift metrics like memory, cluster changes and API calls. Many OpenShift administrators also make use of the Hawkular and Operator Framework open source projects to monitor and manage their deployments. Hawkular’s Alerting component provides a generic mechanism for triggering alerts and subsequent notifications, actions, escalations or suppressions. It should be noted that Hawkular also integrates with Prometheus.

Operator Framework is an open source toolkit used to manage Kubernetes native applications in an effective, automated and scalable way. Pulling metrics and events data from Operator Framework helps administrators achieve better control and visibility on OpenShift.

Challenges monitoring and managing alerts on OpenShift

Manageability: In a large OpenShift deployment, the last thing you want to do is maintain a sophisticated alerting rules engine to serve a large team of on-call engineers. With Alertmanager configuration is managed via command-line flags and a YAML file. The YAML file contains the silencing, aggregating, notifying rules which must be configured and maintained. Prometheus offers a very basic visual tool to help in the editing of routing rules. Similarly, Hawkular offers no visual UI and all triggers, rules and actions must be manually coded. All this manually coding gets very clunky and unmanageable in dynamic environments with many metrics and end users that need to be notified when certain conditions are met.

Alert Noise: In large, complex IT environments there are always a variety of monitoring tools in use. This occurs when specialized tools are deployed to monitor specific elements of the stack, like the networking or logging, or simply because an Ops teams has a preference for one monitoring tool over another. This means that although Alertmanager’s grouping capability helps reduce Prometheus’ metric-centric “alert noise”, it does nothing to reduce the combined noise of alerts coming from other tools. Similarly, Hawkular currently only supports Prometheus and Elastic. This ends up providing very limited grouping capabilities overall if other tools are being used.

Inefficient Root Cause Analysis: Identifying the root cause of an issue in a complex OpenShift environment often requires correlating data from multiple systems, multiple parts of the stack, as well as multiple, incompatible data types. Alertmanager and Hawkular have no ability to automatically identify and correlate relevant data from other systems, regardless of the datatype or timeframe. This is especially important when it comes to Operator Framework events which are critical in troubleshooting Kubernetes-native applications and infrastructure.

Dumb Alerts: Alerts that are generated by Prometheus or Hawkular are just that, alerts. There is no enrichment of the alerts with suggested solutions, KB articles or similar issues that have been resolved in the past that can help engineers get to an accurate solution quickly, enabling faster MTTR.

AI and machine learning powered correlations for OpenShift to the rescue

SignifAI is a SaaS-based correlation engine that leverages AI and machine learning to help administrators identify root causes and solutions quickly in complex OpenShift environments. SignifAI’s automated platform requires no previous AI or machine learning expertise to use. Connecting your OpenShift monitoring tools like Prometheus and Operator Framework to the SignifAI platform is accomplished via APIs or webhooks. There are no agents to install and no data tagging is required. SignifAI automatically transforms the data to find relationships, correlations, anomalies and trends. These results are then presented in a UI in the form of issue cards. Users also have the ability to input their own logic (“train the model”) to increase the accuracy of the correlations.

Reduced Alert Noise: SignifAI takes the complexity out of grouping Prometheus and Hawkular alerts by using AI and machine learning to automatically group related alerts and reduce Prometheus or Hawkular-specific alert noise. On top of that, SignifAI also automatically groups related alerts that might be coming from other monitoring systems. This ends up delivering real “alert noise” reduction because it takes into account all the monitoring systems in a complex OpenShift environment, not just Prometheus or Hawkular.

Efficient Root Cause Analysis: SignifAI uses AI and machine learning to identify powerful correlations across all the operational data found in an complex OpenShift environment. This give administrators the full context surrounding an issue that they will need to conduct efficient root cause analysis. SignifAI then enriches each issue with relevant data, regardless of the timeframe, datasource or data type, including Operator Framework data plus logs, events or metrics from other systems. Contrast this with Prometheus’ and Hawkular’s limited view of metrics (or Elastic logs.)

Enriched and Actionable Alerts: SignifAI enriches each issue with suggested solutions, KB articles and links to similar issues that have been successfully resolved in the past. These enrichments enable engineers to get to an accurate solution much more quickly than by manually search for or formulating solutions from scratch. Intelligent and actionable alerts translate into faster MTTR whether you are using Prometheus, Hawkular or either of these tools in conjunction with other monitoring tools.

What’s next?

Originally published at blog.signifai.io on June 7, 2018.

--

--

Guy Fighel
SignifAI

Co-founder and CTO @SignifAI. Thinking machine intelligence, software engineering, scaling infrastructure, systems automation. Previously @TangoMe, @Vonage.