This article will teach you how my software engineering teams have learned to cut through the lies and uncover the glorious truth that your metrics are begging to tell you.

Metrics exist because humans don’t have the capacity to fully understand how software systems behave. Two reasons for this: 1.) they are big, and 2.) we observe them indirectly (you can’t see inside a CPU).

Metrics are a proxy for reality. Here are the lessons I’ve learned to use metrics effectively to tell reality’s story.

Lesson 1: Use Percentiles

Alternate title: avoid averages like the plague.

Why do averages suck? 🤔

Consider this example: you have some code which measures how long your users wait for a page to load. You have collected 6 data points, in milliseconds: 62, 920, 37, 20, 850, and 45. If you average these load times, you get 322. But 322ms is not representative of your users’ experience. From this data, it’s clear some of your users are having a very fast experience (less than 70ms), and some are having a very slow experience (greater than 850ms). But none of them are having a mathematically average experience. Bimodal distributions like this are very common when measuring response times. Using an average can mislead you.

How to replace averages? 🤔

The way to avoid being misled is to use percentiles. A good place to start is P50 and P90. To compute P50, which is really just the median, sort the data points in ascending order: 20, 37, 45, 62, 850, 920. You get P50 by throwing out the bottom 50% of the points and looking at the first point that remains: 62ms. You get P90 by throwing out the bottom 90% of the points and looking at the first point which remains: 920.

Using percentiles has these advantages:

  1. Percentiles aren’t skewed by outliers like averages are.

You can plot percentiles on a time series graph just like averages. My current team plots P50, P90, P99, and P99.9. This is pretty common. We also have separate alarm thresholds for each percentile. Our P50 latency, for example, tends to be less than one third of our P99 latency (and less than a fifth of P99.9), so we have different thresholds for alarming on these different percentiles.

Percentile metrics work for more than just latency. Imagine you have a set of 50 web servers, and you want to know how much memory they are using. If you look at the average memory utilization across all hosts, you won’t necessarily see a representative picture. Instead, use the P0, P50, and P100 of all the hosts’ memory. Then you have a nice picture that tells you the lowest usage, highest usage and median usage. As a bonus, you also get to see the difference between the high and low host, giving you valuable insights into how your application behaves on multiple hosts.

When you use percentile-based metrics, you get a much better sense for reality.

Lesson 2: Use Alarms

This one is obvious. A metric without an alarm is a manual chore you’ve assigned yourself, dooming you to one of two fates:

  • Fate 1: You’ll never look at it

How to avoid these fates? 🤔

For any metric worth tracking, there should be some threshold that causes you to get notified.

Creating good alarms is the key to metric productivity, but it can also bury you in false positives (commonly called “spurious alarms”), eventually causing you to ignore alarms. There are ways to avoid this pitfall:

Preventing Alarm Noise: Grace periods

For every alarm, you should not only have a threshold, but also a grace period before the alarm fires. In other words, the metric must show some “sticking power” when it crosses the threshold before it turns into an alarm. For example, when some disk or router stalls out for a few seconds, but recovers on its own, we don’t want our on-call engineer to get paged at 3am.

Grace periods can be time-based or sample-based. A time based grace period ensures that a metric remains in the alarm threshold for a given amount of time before it turns into an alarm. If you know that your CPU occasionally spikes to 90% for a few seconds (say, when a cron job runs), and you have an alarm threshold at 50%, you can add a 60-second grace period to the alarm, to prevent this scenario from alarming.

Take into account the metric sampling period when setting the grace period. If your metric samples every 5 minutes, and you set a 5-minute grace period, you get no benefit.

Many alarm systems will allow you to set up two grace periods: one for getting into an alarm state, and one for getting out of an alarm state. An alarm will only clear when the metric moves back into safe territory, and stays there.

Preventing Alarm Noise: Daytime Alarms

Some alarms are not important enough to get out of bed in the middle of night. A daytime alarm will send an alert, but only during normal working hours. If such an alarm happens at night, the system will wait to notify you until the next day. Smart daytime alarms can also avoid weekends!

My team uses daytime alarms for early warning indicators. For example, if available host memory drops, but not dangerously low, the system will send us a daytime alarm. But if available memory drops to a dangerous level, the system will send us a regular alarm, regardless of the time of day or day of week.

Lesson 3: Adaptive Thresholds

Most metrics have predictable cycles. Perhaps your traffic grows during daytime hours and shrinks during nighttime hours. A flat threshold won’t notify you if there is an unusual traffic spike at night.

My team has tools which will predict your metric values based on historical data points. Using historical data points, you can set up alarm thresholds which vary by time of day or day of week.

But even if it doesn’t make sense to adapt your alarm thresholds based on the time of day, you should still change your thresholds as your software evolves. More on this in Lesson 5.

Lesson 4: Missing Metrics

It’s natural to think about alarm thresholds, but what if a metric stops reporting? The absence of a metric can be just as bad as a metric that breaches a threshold. For every alarm you create, you should have a corresponding alarm that fires if the metric stops reporting. Like regular alarms, these should also have a grace period to avoid false alarms if the metric recovers on its own (or maybe use a daytime alarm in this case).

Lesson 5: Regular Automated Reviews

Things change. Make sure your alarms change with them. For example, let’s say you recently shipped some new code which drastically improves the runtime speed of your application. You have some alarms that will tell you if performance gets worse, but their thresholds are based on the old (slow) performance. These need to be updated now, but how would you remember that? The answer is that you should build mechanisms to take care of this for you.

So how can you keep your thresholds up to date in a way that scales with the growing complexity of your software? My team has a tool that analyzes all our current alarms, inspects the metrics data over the past few weeks, and then proposes changes to our thresholds. If a metric has moved farther away from its alarm threshold, the tool proposes a tighter threshold. We hold a weekly review meeting, and this tool lets us know which alarm thresholds need to be reconsidered. All we have to do is answer “yes” or “no” to each proposal, and it will automatically update the threshold.

Automation is not a silver bullet, though. You should also be familiar with your metrics’ regular shape and behavior. To achieve this, create dashboards that can be consumed in just a few minutes of scrolling. Most metrics systems provide a dashboarding feature that makes this easy. You and your team should set aside regular time where you can visually review your metrics and look for anomalies.

Lesson 6: What to Measure

I like the Etsy philosophy: measure everything. But sometimes I need a little inspiration to think of the specific metrics I should be measuring. To help get the creative juices flowing, I like to ask myself these kinds of questions:

  • What are my customers trying to achieve when they use my software?

If you’re building a web application or web service, here are some more specific ideas to get you started:

  • P50, P90, P99 latency. This is the amount of time the server spends processing each HTTP request, between the time the request arrives at your code, and the time your code generates the response. Slice this per URL pattern.

Lesson 7: Outsource Your Metric Infrastructure

Don’t build your own metrics infrastructure. There are tons of good options out there. I’ve used statsd from Etsy, Graphite, Librato, Splunk and many others. Don’t reinvent the wheel!

Software engineer working on Amazon Alexa, co-host of the @SoftSkillsEng podcast. Posts are my own.