How to metric

This article will teach you how my software engineering teams have learned to cut through the lies and uncover the glorious truth that your metrics are begging to tell you.

Metrics exist because humans don’t have the capacity to fully understand how software systems behave. Two reasons for this: 1.) they are big and 2.) we observe them indirectly (you can’t see inside a CPU).

Metrics are a proxy for reality. Here are the lessons I’ve learned to use metrics effectively to tell reality’s story.

Lesson 1: Use Percentiles

Alternate title: avoid averages like the plague.

Why do averages suck? 🤔

Consider this example: you have some code which measures how long your users wait for a page to load. You have collected 6 data points, in milliseconds: 62, 920, 37, 20, 850, and 45. If you average these load times, you get 322. But 322ms is not representative of your users’ experience. From this data, it’s clear some of your users are having a very fast experience (less than 70ms), and some are having a very slow experience (greater than 850ms). But none of them are having a mathematically average experience. Bimodal distributions like this are very common when measuring response times. Using an average can mislead you.

How to fix it? 🤔

The way to avoid being misled is to use percentiles. A good place to start is P50 and P90. To compute P50, which is really just the median, sort the data points in ascending order: 20, 37, 45, 62, 850, 920. You get P50 by throwing out the bottom 50% of the points and looking at the first point that remains: 62ms. You get P90 by throwing out the bottom 90% of the points and looking at the first point which remains: 920.

Using percentiles has these advantages:

  1. Percentiles aren’t skewed by outliers like averages are.
  2. Every percentile data point is an actual user experience, unlike averages.

You can plot percentiles on a time series graph just like averages. My current team plots P50, P90, P99, and P99.9. This is pretty common. We also have separate alarm thresholds for each percentile. Our P50 latency, for example, tends to be less than one third of our P99 latency (and less than a fifth of P99.9), so we have different thresholds for alarming on these different percentiles.

Percentile metrics work for more than just latency. Imagine you have a set of 50 web servers, and you want to know how much memory they are using. If you look at the average memory utilization across all hosts, you won’t necessarily see a representative picture. Instead, use the P0, P50, and P100 of all the hosts’ memory. Then you have a nice picture that tells you the lowest usage, highest usage and median usage. As a bonus, you also get to see the difference between the high and low host, giving you valuable insights into how your application behaves on multiple hosts.

When you use percentile-based metrics, you get a much better sense for reality.

Lesson 2: Use Alarms

This one is obvious. A metric without an alarm is a manual chore you’ve assigned yourself, dooming you to one of two fates:

  • Fate 1: You’ll never look at it (what’s the point of having the metric)
  • Fate 2: You’ll force yourself to look at it on a routine schedule (boring)

How to avoid these fates? 🤔

For any metric worth tracking, there should be some threshold that causes you to get notified.

Creating good alarms is the key to metric productivity, but it can also bury you in false positives (commonly called “spurious alarms”), eventually causing you to ignore alarms. There are ways to avoid this pitfall:

Grace periods

For every alarm, you should not only have a threshold, but also a grace period before the alarm fires. In other words, the metric must show some “grit” or “sticking power” when it crosses the threshold before it turns into an alarm. This helps prevent false positives that inevitably crop up. For example, when some disk or router stalls out for a few seconds, but recovers on its own, we don’t want our on-call engineer to get paged at 3am.

Grace periods can be time based or sample based. A time based grace period ensures that a metric remains in the alarm threshold for a given amount of time before it turns into an alarm. If you know that your CPU occasionally spikes to 90% for a few seconds (say, when a cron job runs), and you have an alarm threshold at 50%, you can add a 60-second grace period to the alarm, to prevent this scenario from alarming.

Many alarm systems will allow you to set up a grace period for getting into an alarm state, and a separate grace period for getting out of an alarm state. This way, if a metric starts alarming, it will only stop when it moves back into safe territory, and stays there.

Daytime Alarms

Some alarms are valuable, but not valuable enough to get out of bed in the middle of night. A daytime alarm will send an alert, but only during normal working hours. If such an alarm happens at night, the system will wait to notify you until the next day. Smart daytime alarms can also avoid weekends!

My team uses daytime alarms for early warning indicators. For example, if available host memory drops, but not dangerously low, the system will send us a daytime alarm. But if available memory drops to a dangerous level, the system will send us a regular alarm, regardless of the time of day or day of week.

Lesson 3: Adaptive Thresholds

Most metrics have predictable cycles. Perhaps your traffic grows during daytime hours and shrinks during nighttime hours. A flat threshold won’t notify you if there is an unusual traffic spike at night.

My team has tools which will predict your metric values based on historical data points. Using historical data points, you can set up alarm thresholds which vary by time of day or day of week.

But even if it doesn’t make sense to adapt your alarm thresholds based on the time of day, you should still change your thresholds as your software evolves. More on this in Lesson 5.

Lesson 4: Missing Metrics

It’s natural to think about alarm thresholds, but what if a metric stops reporting all together? The absence of a metric can be just as bad as a metric that crosses a threshold. For every alarm you create, you should have a corresponding alarm that fires if the metric stops reporting. Like regular alarms, these should also have a grace period to avoid false alarms if the metric recovers on its own (or maybe use a daytime alarm in this case).

Lesson 5: Regular Automated Reviews

Things change, and when they do, make sure your alarms change with them. For example, let’s say you recently shipped some new code which drastically improves the runtime speed of your application. You have some alarms that will tell you if performance gets worse, but their thresholds are based on the old (slow) performance. These need to be updated now, but how would you remember that? The answer is that you should build mechanisms to take care of this for you.

So how can you keep your thresholds up to date in a way that scales with the growing complexity of your software? My team has a tool that analyzes all our current alarms, inspects the metrics data over the past few weeks, and then proposes changes to our thresholds. If a metric has moved farther away from its alarm threshold, the tool proposes a tighter threshold. We hold a weekly review meeting, and this tool lets us know which alarm thresholds need to be reconsidered. All we have to do is answer “yes” or “no” to each proposal, and it will automatically update the threshold.

Automation is not a silver bullet, though. You should also be familiar with your metrics’ regular shape and behavior. To achieve this, create dashboards that can be consumed in just a few minutes of scrolling. Most metrics systems provide a dashboarding feature that makes this easy. You and your team should set aside regular time where you can visually review your metrics and look for anomalies.

Lesson 6: What to Measure

I like the Etsy philosophy: measure everything. But sometimes I need a little inspiration to think of the specific metrics I should be measuring. To help get the creative juices flowing, I like to ask myself these kinds of questions:

  • What are my customers trying to achieve when they use my software?
  • What decisions do customers have to make when using my software (click this or click that)? What do they chose?
  • What are my important business objectives?
  • When my software fails, how will that manifest?
  • What are the basic operating needs of my software (CPU, memory, network access, disk usage, performance of downstream dependencies)? How can I measure that those needs are being met?
  • In the code my team wrote last week, what would my boss ask me about that would make me seem super informed?
  • What are the performance requirements of my software?

If you’re building a web application or web service, here are some more specific ideas to get you started:

  • P50, P90, P99 latency. This is the amount of time the server spends processing each HTTP request, between the time the request arrives at your code, and the time your code generates the response. Slice this per URL and in aggregate.
  • Latency breakdown: time spent in application code, time spent waiting for the database, cache, or downstream services. There are great tools that can help with this like New Relic (but last I checked, New Relic only used averages and not percentiles, ugh).
  • Number of requests that result in response code 200–299, 400–499, and 500–599. Set alarms on the last 2. A 400-series alarm may tell you that your clients can’t figure out how to use your API. A 500-series alarm tells you you’ve got serious issues.
  • Number of GET, PUT, POST, DELETE, etc. requests.
  • Total number of requests. Set an alarm that tells you if traffic surges unexpectedly, or if request counts go to zero!
  • Application-specific load times (P50, P90, P99). Twitter uses time to first tweet. We use the performance API to determine how long it took before the user could start interacting with our page. What is the best measure of your application’s startup time?
  • P50, P90, P99 request and response payload sizes.
  • P50, P90, and P99 gzip compression ratio. How much is gzip helping your clients? Are your payloads actually compressible?
  • Number of load balancer spillovers. How often are your inbound requests rejected because your web servers are at capacity.
  • Cache hit and miss counts.
  • P0, P50, and P100 for the sizes of objects stored in your cache.
  • The basic host metrics: disk utilization, disk I/O rate, memory utilization, and cpu utilization. One very useful metric lots of people don’t think of is load average, which tells you how well your hardware (including disk and network) is keeping up with the demands of your software.
  • Service input/output operations. If you’re on AWS, you may be using services like ELB or DynamoDB which throttle you if you exceed a certain I/O threshold. If this happens, your app can slow to a crawl and even become unavailable (this happened to my team a few years ago, and it was a pain to diagnose — we added lots of alarms after this event).
  • Unhealthy host count. This is a common metric reported by web load balancers. It tells you how many hosts your load balancer currently considers healthy.
  • Number of connections open to each web server, database host, queue server, and any other service you have.

Lesson 7: Outsource Your Metric Infrastructure

Don’t build your own metrics infrastructure. There are tons of good options out there. I’ve used statsd from Etsy, Graphite, Librato, Splunk and many others. Don’t reinvent the wheel!