How to metric

Lesson 1: Use Percentiles

Alternate title: avoid averages like the plague.

  1. Percentiles aren’t skewed by outliers like averages are.
  2. Every percentile data point is an actual user experience, unlike averages.

Lesson 2: Use Alarms

This one is obvious. A metric without an alarm is a manual chore you’ve assigned yourself, dooming you to one of two fates:

  • Fate 1: You’ll never look at it
  • Fate 2: You’ll force yourself to review it on a schedule (boring)

Lesson 3: Adaptive Thresholds

Most metrics have predictable cycles. Perhaps your traffic grows during daytime hours and shrinks during nighttime hours. A flat threshold won’t notify you if there is an unusual traffic spike at night.

Lesson 4: Missing Metrics

It’s natural to think about alarm thresholds, but what if a metric stops reporting? The absence of a metric can be just as bad as a metric that breaches a threshold. For every alarm you create, you should have a corresponding alarm that fires if the metric stops reporting. Like regular alarms, these should also have a grace period to avoid false alarms if the metric recovers on its own (or maybe use a daytime alarm in this case).

Lesson 5: Regular Automated Reviews

Things change. Make sure your alarms change with them. For example, let’s say you recently shipped some new code which drastically improves the runtime speed of your application. You have some alarms that will tell you if performance gets worse, but their thresholds are based on the old (slow) performance. These need to be updated now, but how would you remember that? The answer is that you should build mechanisms to take care of this for you.

Lesson 6: What to Measure

I like the Etsy philosophy: measure everything. But sometimes I need a little inspiration to think of the specific metrics I should be measuring. To help get the creative juices flowing, I like to ask myself these kinds of questions:

  • What are my customers trying to achieve when they use my software?
  • What decisions do customers have to make when using my software (click this or click that)? What do they chose?
  • What are my important business objectives?
  • When my software fails, how will that manifest?
  • What are the basic operating needs of my software (CPU, memory, network access, disk usage, performance of downstream dependencies)? How can I measure that those needs are being met?
  • In the code my team wrote last week, what would my boss ask me about?
  • What are the performance requirements of my software?
  • P50, P90, P99 latency. This is the amount of time the server spends processing each HTTP request, between the time the request arrives at your code, and the time your code generates the response. Slice this per URL pattern.
  • Latency breakdown: time spent in application code, time spent waiting for the database, cache, or downstream services. There are tools that do this like New Relic (but last I checked, New Relic only used averages and not percentiles, ugh).
  • Number of requests that result in response code 2xx, 3xx, 4xx, and 5xx. Set alarms on the last 2. A 4xx alarm may tell you that your clients can’t figure out how to use your API. A 5xx alarm tells you you’ve got serious issues. Use both absolute counts and rates. Example: alarm if the 5xx rate exceeds 0.025% of all requests.
  • Number of GET, PUT, POST, DELETE, etc. requests.
  • Total number of requests. Set an alarm that tells you if traffic surges unexpectedly, or if request counts go to zero!
  • Application-specific load times (P50, P90, P99). Twitter uses time to first tweet. We use the performance API to determine how long it took before the user could start interacting with our page. What is the best measure of your application’s startup time?
  • P50, P90, P99 request and response payload sizes.
  • P50, P90, and P99 gzip compression ratio. How much is gzip helping your clients? Are your payloads actually compressible?
  • Number of load balancer spillovers. How often are your inbound requests rejected because your web servers are at capacity.
  • Cache hit and miss counts.
  • P0, P50, and P100 for the sizes of objects stored in your cache.
  • The basic host metrics: disk utilization, disk I/O rate, memory utilization, and cpu utilization. One very useful metric lots of people don’t think of is load average, which tells you how well your hardware (including disk and network) is keeping up with the demands of your software.
  • Service input/output operations. If you’re on AWS, you may be using services like ELB or DynamoDB which throttle you if you exceed a certain I/O threshold. If this happens, your app can slow to a crawl and even become unavailable (this happened to my team a few years ago, and it was a pain to diagnose — we added lots of alarms after this event).
  • Unhealthy host count. This is a common metric reported by web load balancers. It tells you how many hosts your load balancer currently considers healthy.
  • Number of connections open to each web server, database host, queue server, and any other service you have.

Lesson 7: Outsource Your Metric Infrastructure

Don’t build your own metrics infrastructure. There are tons of good options out there. I’ve used statsd from Etsy, Graphite, Librato, Splunk and many others. Don’t reinvent the wheel!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dave Smith

Dave Smith

VP of Engineering. Formerly software engineer on Amazon Alexa, co-host of the @SoftSkillsEng podcast. Posts are my own, but you can read them if you like.