How we do Alerting and Escalation

Published in

Hippo Engineering Blog

7 min readSep 7, 2022

When I first joined Hippo in late 2019, if there was an outage from one of our systems then our engineering team would normally hear about it from our production support team. One of our agents would be using the app and it would stop working or they would discover a bug. This led to longer lead times to fix and less trust in our in-house developed systems.

Last year (2021) we were finally in a position to automate our outage detection in a way that would meaningfully benefit our business. We implemented a set of organization-wide alerts that could notify our engineering on-call through PagerDuty alerts to cell phones or through a Slack channel. We setup an alerts channel and defined a small number of P0 and P1 alerts to tell us if something was wrong before one of our users detected it.

It was imperative that we kept the number of alerts small, in order to build trust in alerts and monitoring and not have too much noise. We chose the number for organization-wide P0/P1 alerts to be 10 important alerts of each severity. Teams could still implement their own alerts for team-level alerts if they wanted to watch a service or system more closely.

Now we detect almost every technical problem our outage long before our business does. We often are able to adjust and take care of them quickly. This is the story of what we did and how we did it.

Real-time Metrics

In order to alert our team about problems that occur with our systems, we need some way to monitor them real-time. We had been using a log ingestion system for a while for our logs, but the purpose of logs and the purpose of real-time monitoring are different. Logs are there to provide us detailed information about problems. Some log management products also provide aggregation and alerting but as our system grew we outgrew the ability for the logging system alone to monitor our services.

We started to use Prometheus for real-time metrics. Prometheus provides a time-series database that makes it easy to query rates and trends. Most HTTP services expose an endpoint for Prometheus to scrape, collecting metrics using a “pull” method. Many Prometheus libraries and plugins provide default metrics for HTTP services. They also make it easy to create or instrument metrics for events.

For example, let’s say we want to capture an event every time a customer performs a certain action on our site, like getting a quote for insurance. We have a back end API that issues a quote. We could instrument that back end API to have a quoted event that’s recorded in real time through Prometheus. The pseudo-code would look something like:

metrics.createEvent('quoted', typeOfProperty, priceRange);

Notice that the example parameters are parameters that would generally be thought of as an enum. That’s not a mistake. Prometheus has a limit on cardinality of dimensions. We can record multiple dimensions within an event but each dimension should have a small number of possible values. If we’d use a wide range of values then Prometheus would run out of space in the underlying database. Here is a deeper discussion of that cardinality.

One more thing to note is that our real-time metrics system is not meant to store data forever. We set a certain amount of storage space to match roughly 14 or 30 day windows and when the system runs out of space, older events are deleted. Some organizations might require a longer window to retain metrics, so your retention window could vary.

Below is a simple summary of our monitoring and alerting systems. We have real time alerts as described above. We also use a system for business alerting, that allows us to alert us if business metrics change by using a query from our data warehouse. The two combined together let us watch both our distributed system and the health of the business at once:

A simplified summary of our monitoring and alerting at Hippo

Defining and tuning alerts

The first thing we needed to do was figure out what we wanted to monitor. Some of this was easy, since for every service we run — internal or external — we want to know:

Error Rate. What is the error rate? Do we see an error rate higher than our stated goal?
Latency. What’s the latency of the median call to that service (also called 50p for 50th percentile)? What about the 95p call? Is the latency higher than we designed?
Throughput. What’s the rate of requests to each service in our system? Do we see an unexpected jump or drop in that rate?

Most scalable, well constructed APIs should have an error rate of 0.1% or less, depending on what kind of dependencies they have. I believe it’s OK to start with a rate much higher than that for alert thresholds and bring them down over time. For latency I like to use the guideline of no more than 1 second for 95p but latency can be much more variable than error rate, depending on what your service does.

Some more examples of metrics we might watch:

Queue size for async systems. If queue size grows it could mean our system stopped processing items.
Dead letter queue size. Often indicates errors in async systems.
Memory/CPU. If a service could be memory or CPU bound.

I’ll introduce a very simple example of tuning an error rate alert. Your service and error rate may vary. You also want to probably gather a longer history of data than I’m using for this example. Let’s say we have a service error rate graph over the last 12 hours that looks like this:

We can draw a few conclusions to tune an alert:

Between 7:00am and 3:30pm the peak error rate is around 1.75%. That error rate lasts for less than 10 minutes.
There is a higher, short spike before 7am
The typical sustained error rate for this service hovers below 1%
I would start by configuring an alert to detect an error rate above 1% for a sustained 15 or 20 minutes

We normally have this alert post to a Slack channel, then watch it for a few days. If it fires false alarms then we consider either fixing the underlying service or changing the alert threshold higher if that can’t be done.

If we’re not seeing it fire at all and we’re not satisfied with the service performance, we may adjust the thresholds of both rate and time down a little to detect outages more consistently.

Sometimes we have a service that usually has zero or very low error rate. If we see any spike in error rate it could indicate an outage. The example below has a short, high spike in error rate. Since it lasts only a few minutes, we probably want to ignore it as a blip or network hiccup and set the threshold longer than the spike:

Another service error rate example, with a spike in errors

Our escalation process

When we have an automated alert fire we had to define an escalation process. The basic idea of an escalation process is to have a script that one of our engineers on call can follow to notify the right people of a problem and fix the problem as quickly as possible. Sometimes when an unfamiliar issue pops up with our system, it’s easy to panic and not know who to contact next for help. The escalation process removes that doubt and gives clarity to our on-call.

We setup a simple set of steps to escalate:

Investigate the problem to classify as a real problem or false alarm. Every alert in our P0 and P1 alert list has a playbook that should describe what to do if that alarm shows up.
If the problem is a real problem, alert our user support team about the issue. If it can be fixed or mitigated quickly (within less than an hour), fix it.
If it can’t be fixed and the issue is P0, open a Slack channel and a Jira ticket describing the issue.
Add the right people to the channel to help fix.

These simple steps give the on-call instructions on what to do when an alert shows up. They don’t need to know how to fix it, only what to do when they see it and how to contact someone to help fix it. Our team resolves most issues within a couple of hours.

We have other paths of escalation, such as user complaints about problems with the system, which get injected into this process in other ways but are probably beyond the scope of this short post.

What can you do as a small organization?

Not every organization has a 150-strong engineering team. Maybe you don’t even have 25. What can you do with a smaller team to achieve some of the same results?

Most log management systems have an alerting feature or function. With a small volume of traffic, it’s easy to setup alerts and thresholds all in one place. We use Loggly for processing logs, which has alerting functions that let you configure alerts to e-mail or Slack channels. That’s perfectly adequate for a 10 person team to set up some basic alerting. It’s also not that difficult to setup a Prometheus/Grafana combo and allow metric scraping from a running service. A good dev/devops team can prototype this in less than a day.

How we do Alerting and Escalation

Real-time Metrics

Defining and tuning alerts

Our escalation process

What can you do as a small organization?

Written by Mike Gordon