Managing Critical Alerts through PagerDuty’s Event Rules

Shrinidhi Kulkarni
6 min readJun 30, 2024

--

Everyone hates PagerDuty, not the tool itself, but getting paged.

PagerDuty is one of the most useful tools for incident response, and people have been using it for a long time to alert the on-call team about incidents. In this blog, I discuss alerting the entire group instead of just the on-call team, based on the severity of the alerts.

A rough overview of how PagerDuty alerts are generally configured

(The overall setup can be found via Claude/ChatGPT as well, and it is skippable if you don’t want to know about the generic setup.)

PagerDuty Alerts

PagerDuty documentation is amazing, and you will find more about this there. However, I will roughly outline the settings that the team usually configures in a normal setup.

1. Create a Service:

  • In PagerDuty, create a service that represents the system or application you want to monitor.
  • This service will receive alerts and route them to the appropriate team

2. Set up Escalation Policies:

  • Create an escalation policy that defines who should be notified and in what order.
  • This typically includes primary on-call, secondary on-call, and potentially management escalations.

3. Configure On-Call Schedules:

  • Create on-call schedules that define who is on-call at any given time.
  • These schedules are usually rotated among team members.

4. Integration Setup:

  • Set up integrations with your monitoring tools, such as Sysdig, Prometheus, or custom scripts.
  • PagerDuty provides integration guides for many popular monitoring systems.

5. Alert Rules:

  • Configure alert rules in your monitoring system to send notifications to PagerDuty when specific conditions are met.

6. Notification Rules:

  • In PagerDuty, set up notification rules for how each person wants to be contacted (e.g., SMS, phone call, email, mobile app push notification).

7. Incident Creation:

  • When an alert is triggered, PagerDuty creates an incident and notifies the on-call person based on the escalation policy.

TLDR — The alerts goes to a service and the service has an escalation policy on whom should be alerted — usually the on-call’s are alerted followed by others if the alerts are missed/not acknowledged

Let’s think of some interesting scenarios now:

  1. What if the person on-call acknowledges the alert but doesn’t do much to fix it and the system gets worse?
  2. What if the on-call person is bombarded with alerts and doesn’t know which is the important one to work on?

I have seen this before, and since I love and care about my product, I want more eyes on it to fix these issues. These are the few moments when I want to alert the entire team instead of just the on-call person.

I want to further divide my blog into two topics:

  1. Which alerts I want the on-call team to look into and which alerts I want the whole team to look into.
  2. When I want the whole team to look into an alert, how do I configure it in PagerDuty?

Monitor tool configuration for triggering the alert

To explain which metrics the on-call must be alerted on versus those that require alerting the entire team, I will start with a simple query (I chose a Prometheus query for its simplicity, but it can be anything).

Prometheus (PromQL) for CPU usage:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  1. node_cpu_seconds_total{mode="idle"}: This metric represents the total number of seconds the CPU has been idle.
  2. rate(...)[5m]: This calculates the per-second rate of change of the idle CPU time over the last 5 minutes.
  3. avg by(instance) (...): This averages the rate across all CPUs for each instance (node).
  4. ... * 100: This converts the rate to a percentage.
  5. 100 - (...): This subtracts the idle CPU percentage from 100 to get the CPU usage percentage.

In essence, this query calculates the average CPU usage percentage across all CPUs for each instance over the last 5 minutes.

Now, for severity-based queries:

Query1 (Warning) query:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70

This would trigger when CPU usage exceeds 70% for 5 minutes.

Query2 (Critical) query:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90

This would trigger when CPU usage exceeds 90% for 5 minutes.

The main difference between Query2 and Query1 here is the threshold. Query2 uses a higher threshold (90%) because it represents a more critical situation where CPU usage is extremely high, potentially causing severe performance issues or service disruptions. Query1 uses a lower threshold (70%) as an early warning that CPU usage is getting high but hasn’t reached critical levels yet.

Query1 is the scenario where I will alert the on-call, and Query2 is the scenario where I would alert the whole team.

The alerting for Query1 is covered above, but if you have skipped it (which I recommend doing), this is how it works:

Trigger from monitoring tool (PromQL query condition met) -> Alert sent to PagerDuty -> PagerDuty creates incident for the associated service -> Service’s escalation policy is activated -> Notification sent to on-call personnel

Here comes the main topic: when we discuss alerting the entire team when Query2 is triggered.

Alerting the entire team with help of Event Rules

PagerDuty has a helpful blog about setting up event rules, which can be found here: [https://support.pagerduty.com/docs/rulesets](https://support.pagerduty.com/docs/rulesets).

Basically, what we will be doing here is configuring the event rules to include a condition to capture Query2 when it is triggered. All triggered events already come to PagerDuty due to previous configurations, but this event rule ensures it captures only the specific type of alert we need. When we capture this particular event, we can create an incident in the Service that has the escalation policies. The change here will be setting up a separate Service whose escalation policies include the entire team instead of just the on-call person.

Example of Event Alerts to match conditions
Creating an incident on the service

So, what ends up happening is when there is a critical query (such as Query2 in our case), it triggers a call to the Service that escalates to the entire team, ensuring everyone in the team is alerted for critical alerts.

We have this system configured for IBM COS, and it has been working brilliantly. I hope every system has something similar to prevent potential hazards.

On a side note, this is my last day in COS, and I really wanted to write this because it was my hackathon idea, and we all worked hard to get it into production.

As I move on to my next adventure, I want to express my heartfelt gratitude to each and every one of you in the IBM COS team. Working with you has been an incredible journey of growth, innovation, and collaboration. From our successful hackathon idea to its implementation in production, your dedication and expertise have been inspiring.

I’m proud of what we’ve accomplished together, especially the robust system we’ve put in place to prevent potential issues. Your commitment to excellence and willingness to embrace new ideas have made COS not just a great product, but a fantastic team to be part of.

Thank you for your support, your insights, and the countless moments of both challenge and triumph we’ve shared. The lessons I’ve learned and the friendships I’ve formed here will stay with me throughout my career.

I wish you all the very best as you continue to innovate and lead in the cloud storage space. Keep pushing boundaries and supporting each other — I have no doubt that great things lie ahead for COS and for each of you.

It’s been an honor to work alongside you. Thank you for making my time here so meaningful and rewarding.

--

--