Prometheus Alertmanager to Atlassian Statuspage

Published in

Go City Engineering

7 min readMar 30, 2023

A custom solution for updating an Atlassian Statuspage from your Prometheus alerts

When I first joined Go City, we had engineers ‘on call’ but the only way they were alerted to problems within our stack was because either a colleague had noticed something wasn’t working or our customer service team had been contacted by our customers.

I wanted to change this around so that as an engineering team, we knew about problems in production before our customer notice (or, at the very latest, the same time). We spent time changing our ways of working to follow a DevOps mindset, ensuring we were prioritising non-functional requirements, especially observability, alongside all of our feature work.

This worked well, we introduced a new observability stack (see below) and the engineer teams were writing alerts, runbooks and running post-mortems after incidents.

The problem is that we were still getting contacted directly from our colleagues about issues we already knew about, and in some instances had already resolved!

Solution: Let’s build a status page!

— A basic understanding of Prometheus, Alertmanager and Atlassian Statuspage is required for the rest of this article.

Our (current) observability stack

This is the tooling we are using for our monitoring and alerting.

Why Atlassian Statuspage?

Atlassian Statuspage was a tool that I had seen multiple other well known companies use, it had everything we needed out of the box, multiple integrations and good API documentation — All of this meant I could get something up and running quickly.

However, there are many alternatives: https://github.com/ivbeg/awesome-status-pages

Attempt 1 — PagerDuty to Atlassian Statuspage integration

Our Prometheus Alertmanager is already configured to create incidents in PagerDuty to notify our on call engineers.

One of the reasons I chose Statuspage was for its integrations, which included PagerDuty to Statuspage.

I hooked this up and did all the required configuration inside Statuspage.

What didn’t work so well?

Every time a new service is created in PagerDuty, it would also involve setting up new automations inside Statuspage.
The information that comes through from the webhook from PagerDuty to Statuspage is lacking important details from the Prometheus alert, which I would have liked to have used as display text.
(I believe this is because the integration uses v2 of the PagerDuty webhook, If the integration is updated to v3 this may change)
De-duplication of alerts. At Go City, multiple alerts may fire for the same domain (e.g. Checkout). But every alert creates a new PagerDuty incident, which in turn created a new incident in Statuspage. This resulted in lots of unnecessary noise, making things look worse than they actually are!

Attempt 2 — DIY Prometheus Alertmanager to Atlassian Statuspage

In order to solve the de-duplication problem from attempt 1, I looked into more detail about what was happening:

Scenario: A problem with customers being able to checkout in our system was being detected in multiple ways (and sometimes by multiple teams)

Alert 1: Customer checkout has a high error rate coming from our APIs
Alert 2: Customer checkout has seen no orders being created in the last 10 minutes
Alert 3: Customer checkout synthetic tests are failing

But the root cause for these alerts, which were all firing around the same time, is probably the same! A resource or service required to checkout has fallen over.

I looked into Alertmanager alert grouping, and into PagerDuty to see if there was a way I could de-dupe for a status page, but still page the teams for each firing alert — I could not, so I built one called prometheus-alerts-to-status-page (Available here: Docker Hub, Github, helm)

Flow (via Prometheus Alerts to Status Page)

How it works

Making use of the Alertmanager webhook config, the service receives grouped alerts and makes a decision on whether to create a new incident, update an existing incident, or resolve an incident on our Statuspage.

Not all Prometheus alerts are equal, when Prometheus fires a group of alerts, prometheus-alerts-to-status-page will calculate the maxvalues for status, impact and component status (more here)

Configuration/Usage

Decorate your Prometheus alerts with some extra labels and annotations — with the Statuspage values you want to use (see documentation) and create a webhook.

- alert: [Alert Name]
  expr: [Alert Expression]
  labels:
    statuspage: true # Used for alert route to send to the statuspage
    statuspagePageId: abc # The statuspage id you want to update (from Statuspage)
    statuspageComponentId: 123 # The statuspage component you want to update (from Statuspage)
  annotations:
    statuspageComponentName: [Component Name] # Used in the incident title on statuspage.
    statuspageStatus: identified # identified|investigating|monitoring|resolved
    statuspageImpactOverride: critial  # none|maintenance|minor|major|critical
    statuspageComponentStatus: major_outage # none|operational|under_maintenance|degraded_performance|partial_outage|major_outage
    statuspageSummary: [Summary for statuspage] # Used for display text on statuspage

- receiver: statuspage-webhook
  groupBy: ['statuspagePageId', 'statuspageComponentId']  
  groupWait: 30s # Initial wait to group any other alerts which may trigger for the same group. (Default: 30s)
  groupInterval: 1m # Don't send alert about new alerts added to the group for the interval(Default: 5m)
  repeatInterval: 4h # Only resend the alert after x (Default: 4h)
  matchers:
    - name: statuspage
      value: "true"
...
- name: statuspage-webhook
  webhookConfigs:
    - url: "http://prometheus-alerts-to-statuspage:8080/alert"

— Note: Prometheus Operator format.

2. Set an environment variable STAUTSPAGE_APIKEY with the API key for your Statuspage

3. If you wish to override the default templates used for the incident title, create body, update body and resolved body you can read more here

Example

Following the checkout scenario above, we have 3 alerts.

Alert 1: Customer checkout has a high error rate coming from our APIs

statuspageImpactOverride: none
statuspageComponentStatus: degraded_performance
statuspageSummary: Customer checkout has a high error rate coming from our APIs

Alert 2: Customer checkout has seen no orders being created in the last 10 minutes.

statuspageImpactOverride: major
statuspageComponentStatus: partial_outage
statuspageSummary: Customer checkout has seen no orders being created in the last 10 minutes.

Alert 3: Customer checkout synthetic tests are failing

statuspageImpactOverride: critical
statuspageComponentStatus: major_outage
statuspageSummary: Customer checkout synthetic tests are failing

As you can see, all 3 have different impacts and component statuses (getting more severe top to bottom).

Example Flow

1. Alert 1 fires and is added to a new group. Alertmanager waits for the groupInterval for any more alerts to be added to the group. None are added, so the webhook sends.
prometheus-alerts-to-status-page creates a new incident with the generated title and generated body (including the alert summary), an impact of none, and component status of degraded performance

2. Alert 2 fires and is added to the existing Alertmanager group, and the webhook sends.
The incident is updated with the higher impact of major and the higher component status of partial outage.
The incident summary has been updated to also include the new alert summary.

3. Alert 3 fires and is added to the existing Alertmanager group, and the webhook sends.
The incident is updated with the higher impact of critical and the higher component status of major outage
The incident summary has also been updated to also include the new alert summary.

4. Alert 1 resolves in the existing Alertmanager group, and the webhook sends.
The incident summary is updated to show that our high error rate API has now resolved.
The incident is still open because Alerts 2 and 3 are still firing.

5. Alerts 2 and 3 resolve in the existing Alertmanager group, and the webhook sends.
Given all the alerts in the group are now resolved, the incident is resolved and status is back to being operational

The incident summary is updated to show that alerts 2 and 3 are now resolved.

6. Some time later, the engineering team do a postmortem and insert the details they would like to share manually inside Statuspage

— Incident: https://nathandeamer.statuspage.io/incidents/7fklpbbszkzm

All the Alertmanger webhook body used in this example is available here

Conclusion

Hopefully prometheus-alerts-to-statuspage will help if you’ve found yourself in a similar situation. Or give you a base to build something custom using these ideas that meet your needs.

Disclaimer: Our current Statuspage is internal only while we migrate alerts with the extra annotations needed. Watch this space.