Intelligent Alert system for HEP experiments at CERN — Part 2

Rahul Indra
5 min readAug 20, 2023

--

GSOC + CERN collaboration

Continuing from Part 1

AlertManager — one place for all alerts

Diagram showing various sources pushing alerts to AM and admins interacting with AM with various tools.

Alerting services which have been developed push GGUS & SSB alerts to AM at defined time intervals. Grafana & Prometheus push their alerts to AlertManager as well. AlertManager gives loads of features to handle alerts but it lacks proper UI. So, Karma Dashboard is used to fetch all alerts from AlertManager, and display them in a decent UI. Slack channels are configured to log alerts when they are fired in AlertManager.

AlertManager, Slack and Karma give all required info for alerts to our Operational teams.

Use of Slack, Karma and Alert CLI Tool

Slack

  • Slack has defined channels for particular service alerts.
  • Users are notified about fired alerts.
  • AlertManager bots are at work.
GGUS alerts in Slack
SSB alerts in Slack

Karma

  • A dashboard which pulls all alerts from AlertManager.
  • Availability of multi grids arrangement based on filters.
  • More concise and better view than AlertManager.
  • Wrote Dockerfile and Kubernetes config files.
Karma Dashboard view-1
Karma Dashboard view-2

Alert CLI Tool

  • gives a nice and clean CLI interface for getting alerts, their details are printed on the terminal itself either in tabular form or JSON format.
  • convenient option for operators who prefer command line tools.
  • comes with several options such as :-
  • service, severity, tag — Filters
  • sort — Sorting
  • details — For detailed information of an alert
  • json — information in JSON format
Alert CLI Tool printing all alerts in the alertmanager of type SSB services which are sorted over duration of each alert.
Alert CLI Tool printing all alerts in the alertmanager whose severity values are “high”.
Alert CLI Tool printing a specific alert in detail.
Alert CLI Tool printing a specific alert in detail in json format.

Intelligence Module

Intelligence module diagram

It is a data pipeline. All components are independent of each other. One component receives the data, adds its logic and forwards the processed data to another component.

Why data pipeline ?

  • Low coupling
  • Freedom of adding or removing components on demand.
  • Power of concurrency

What it does ?

  • assigns proper severity levels to SSB/GGUS alerts which helps operators to understand the criticality of the infrastructure. Ex. If Number of Alerts with severity=”urgent” > some threshold, then the infrastructure is in critical situation.
  • annotates Grafana Dashboards when there are Network or Database interventions.
  • predicts the type of alerts and groups similar alerts with the help of Machine Learning.
  • adds applicable tutorial/instructions doc to alert, following which an operator can solve the issue quickly.
  • deletes old silences for those alerts which have open ending (such as GGUS alerts and some SSB alerts having no End time).

Building Blocks

  • Fetch Alerts
  • Preprocessing
  • Keyword Matching
  • Add Annotations
  • Machine Learning
  • Push Alert
  • Silence Alert
  • Delete Old Silences

Tools

Fetch Alerts

Fetch Alerts diagram
  • fetches all alerts from AlertManager
  • bundles them and puts them on a channel.
  • channel (Analogy) — baggage belt at Airports. You put data into it, data will be picked up when required by other party.

Preprocessing

Preprocessing diagram
  • filtering based on configuration.
  • only filtered alerts are forwarded.
  • we also manage one map for keeping track of active silenced alerts to avoid redundant silences.
  • if an alert is already silenced that means it has been processed by the intelligence module before.

Keyword Matching

Keyword Matching diagram
  • analysis of Alerts showed us repetitive use of a few important keywords.
  • these keywords help in assigning severity levels.
  • searches for these keywords in alerts, if found we assign severity level mapped to that keyword.

Add Annotations

Add Annotations diagram
  • Grafana has dashboards which shows metrics of running services in the form of graphs.
  • Grafana has add Annotation feature.
  • SSB alert mentioning intervention in network / DB affects these services.
  • pushes such interventions info in the form of annotations into Grafana dashboards.

Push Alert

Push Alert diagram
  • alerts with modified information are pushed to AlertManager
  • incoming alerts are then forwarded to Silence Alert.

Silence Alert

Silence Alert diagram
  • alerts which get modified and pushed to AlertManager get copied.
  • older alert is redundant
  • silences the older one for the duration of its lifetime.

Delete Old Silences

Delete Alert diagram
  • Alerts like GGUS & some SSB tickets have open ending time (That means we don’t know for how long they will be in AM).
  • So we wait for those alerts to get resolved, whenever they are resolved they are deleted from the AM by alerting services.
  • But the silences will remain, right ? So, this component takes care of such cases.
  • It delete those silences which get resolved.

Future Work(s)

  • Use of Machine Learning in intelligence module which will predict its severity info, priority and type. We can basically add logics into the MLBox component of intelligence module pipeline.

Tools Used

  • GoLang, Go+Lint, Github, git, Google Suit, Photoshop

Acknowledgements

I am thankful to my mentors for their invaluable guidance and support throughout my GSoC journey.

Mentor details

Valentin Kuznetsov
Cornell University (US)

Federica Legger
Universita e INFN Torino (IT)

Christian Ariza
Universidad de los Andes (CO)

Raise issues if any at: CMSMonitoring

--

--