Have You Been Using Histogram Metrics Correctly ?

Abhishek Munagekar
Making Mercari
Published in
10 min readJun 1, 2022

--

Metrics, tracing and logging are the three pillars of observability. At Mercari US we use Prometheus and Grafana for monitoring metrics. All of our microservices are instrumented with Prometheus clients. These expose various aggregate metrics like number of http requests and request latency. The aggregate metrics are scraped by the Prometheus server and stored in its time series database. The database is then queried by Grafana to draw graphs to visualize metrics. The following diagram shows how we observe application metrics.

Note: Arrows represent requests made from one component to another.

Prometheus Client Library runs alongside the main application process and acts as a http server which is then periodically scraped by the Prometheus Server.

Monitoring Systems Background

This section briefly describes some of the background information relevant to this article including concepts related to monitoring systems, particularly Prometheus. If you are familiar with Prometheus or similar metric monitoring systems you can skip this section.

Types of Metrics

Prometheus has four core types of metrics:

  • Counter: Used for monotonically increasing counters. E.g.: Number of HTTP request received, etc
  • Gauge: Used for arbitrary counters which go both up and down. E.g.: Number of Concurrent Requests
  • Histogram: A histogram samples observations and counts them into pre-configured buckets. E.g.: Histogram for Request Duration. A histogram is aggregatable, and can be used in a distributed system.
  • Summary: A summary is similar to histogram. However it calculates quantile values for pre-configured values. A summary is not aggregatable and cannot be used for distributed systems.

At Mercari US, we use Kubernetes therefore, we use distributed systems so we do not use summaries. We mainly use histograms, counters and gauges.

Histogram Metrics

When the central Prometheus server scrapes or pulls a histogram metric from a microservice instrumented with the Prometheus client, it gets a response similar to the one shown below:

# TYPE fvec_latency histogram
fvec_latency_sum 134420.14452212452
fvec_latency_bucket{le="0.03"} 5849.0
fvec_latency_bucket{le="0.04"} 5850.0
fvec_latency_bucket{le="0.05"} 11326.0
fvec_latency_bucket{le="0.06"} 1.424461e+06
fvec_latency_bucket{le="0.07"} 2.242717e+06
fvec_latency_bucket{le="0.08"} 2.281547e+06
fvec_latency_bucket{le="0.09"} 2.283964e+06
fvec_latency_bucket{le="0.1"} 2.284831e+06
fvec_latency_bucket{le="0.125"} 2.285304e+06
fvec_latency_bucket{le="0.15"} 2.285367e+06
fvec_latency_bucket{le="0.2"} 2.285589e+06
fvec_latency_bucket{le="0.25"} 2.285592e+06
fvec_latency_bucket{le="1.0"} 2.285613e+06
fvec_latency_bucket{le="+Inf"} 2.285619e+06
fvec_latency_count 2.285619e+06
  • Here the name of the metric is fvec_latency, it measures the latency for calculating the feature vector in seconds.
  • The response has various buckets, le (less than or equal to) 30ms, 40ms, 50ms, 60ms, 70ms, 80ms, 90ms, 100ms, 125ms, 150ms, 200ms, 250ms, 1000ms and infinity(catch all).
  • You can also notice that the frequencies for each individual bucket as reported by the Prometheus client are cumulative. This means that the count for le=”0.04” bucket also includes the count for le=”0.03” bucket. Meaning there is only one observation for a request latency >0.03s and <=0.04s (5850–5849=1).
  • Each bucket counts observations which are less than or equal to the bucket size. For this example, an observation of 51ms would increment the cumulative frequency count for all buckets having a size >=51ms, i.e. 60ms, 70ms,…, infinity. While counters for buckets 30ms,40ms, 50ms would be unaffected. If we convert cumulative frequency counts to frequency counts, we can say that the observation of 51ms would lie in the bucket having size 60ms.
  • The response also has a count, i.e. the total number of observations made. Here 2,285,619 observations have been made since the last restart.
  • Furthermore, a total sum of observations, i.e., total sum of all observed latencies for calculating the feature vector is also present.

Quantiles and Quantile Estimation with Linear Interpolation

A quantile in statistics is a portion of the total number of observations. Quantiles have different names depending on the number of portions, for example when the number of portions is four, each portion is called a quartile and when the number of portions is hundred, it is called a percentile. If 25% of all observations lie below or equal a value, we can say that the value is the 25th Percentile, and this value can be represented as p25.

Quantiles with histogram metrics are estimates. One common way to compute them is with linear interpolation. Prometheus provides a way to estimate quantiles for histogram metrics using the histogram_quantile function.

Consider the following hypothetical distribution of observations for 200 observations.

┌─────────────┬──────────────────────┬──────────────────┐
│ Bucket Size │ Cumulative Frequency │ Upper Bound │
│ │ Count │ Percentile │
├─────────────┼──────────────────────┼──────────────────┤
│ 50ms │ 20 │ p10 │
│ 100ms │ 70 │ p35 │
│ 250ms │ 120 │ p60 │
│ 500ms │ 150 │ p75 │
│ 1000ms │ 200 │ p100 │
│ INF │ 200 │ p100 │
└─────────────┴──────────────────────┴──────────────────┘

Along with bucket size and cumulative frequency count, we add a new calculated column “upper bound percentile” to associate a percentile with the bucket. Since exactly 10% (20/200) observations have a value less than 50ms, we can say the upper bound (<=) for p10 is 50ms. Similarly, 35% (70/200) observations have a value less than 100ms, so we can say the upper bound for p35 is 100ms.

How do we find p20 though? p20 would be a value, such that 20% or 40 observations lie below or equal it. Looking at the Cumulative frequency count we can see that, if we sort the observations in ascending order:

  • For the first 20 observations, we have latency <=50ms.
  • While the next 50 observations (21–70), we have 50ms < latency <= 100ms
  • This means 21st Observation would have 50ms < latency <= 100ms
  • And the 40th observation would as well have 50ms < latency <= 100ms

So we have 50ms<p20<=100ms. While a range is useful, it doesn’t give an indication of the most likely value. However, we do have some information regarding distribution of observations, if we assume that all the observations are linearly distributed in a bucket, we can now estimate the most likely value for p20.

With linear interpolation, we can assume that the 21st observation is 51ms and 22nd observation is 52ms which continues in a similar manner giving us a value of 70ms for the 40th observation. We can then estimate p20=70ms.

To get a better understanding of linear interpolation in quantile estimation, we can create a scatter plot, interpolated latency value(y-axis) vs the ith observation in ascending order(x-axis). Since the histogram only consists of frequency counts, there are no direct observations for latency values, and all latency values are interpolated.

The dark blue points represent upper bound percentiles associated with bucket sizes. The orange line shows other linearly interpolated values. To estimate p20, we find the estimated value at observation x=40, giving us the estimated value y=70ms.

In the following section we discuss some of the mistakes that we made while monitoring our applications with histograms.

#1: Correct Resolution and Bucket Configuration

A histogram is a summary of how many observations fall into various ranges. In Prometheus, every histogram metric has a count and a sum for every pre-configured bucket.

By default, for a histogram metric, the Python Prometheus client configures 15 buckets: le 5ms, 10ms, 25ms, 50ms, 75ms, 100ms, 250ms, 500ms, 750ms, 1000ms, 2500ms, 5000ms, 7500ms, 10000 ms, INF. Note that default buckets may be different depending on the client implementation: for example, the Golang Prometheus client configures these default buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1000ms, 2500ms, 5000ms, 10000ms.

The default buckets have been designed to support a wide number of applications. However these buckets may not be suitable for monitoring your application. For example, at Mercari, being an E-commerce website, our SLO (Service Level Objective) for microservices is around the 500ms range for a number of microservices. The smaller buckets (5ms, 10ms, 25ms, 50ms) and bigger buckets (2500ms, 5000ms, 7500ms, 10,000ms) are not useful.

Resolution is the ability of the measurement system to detect and faithfully indicate small changes in the characteristic of the measurement result. In case of default buckets, the resolution is too fine with smaller buckets and too large with the bigger buckets for most of our use cases.

Say all the observations for a latency lie between 260ms and 280ms. The default buckets would be a bad choice. As Prometheus determines quantile values for histogram metrics with linear interpolation, in this case it will report latencies of p50=375ms, p90=475ms and p95=487.5ms. These values would be significantly incorrect compared to true p50 and p90 values. This is known as quantile error with histograms in observability systems. This quantile error can be as large as the difference between the consecutive bucket sizes!

#2: Using Histograms to Measure Performance

Histograms are only as accurate as the underlying bucket resolutions. The histogram metric is not a good tool to measure performance.

For example, let’s say we want to measure the latency improvement when removing istio. Each Istio proxy adds about 2ms to the p90 latency of a microservice. Unless there is a change in buckets distribution for some of the requests, the improvement in performance will not be shown by Prometheus.

On the other hand histograms can also show massive performance improvements, when the real improvements are miniscule. Consider the following hypothetical example:

┌──────────────┬────────┬───────┐
│ Buckets (ms) │ Before │ After │
├──────────────┼────────┼───────┤
│ le100 │ 0 │ 0 │
│ le250 │ 088 │ 90 │
│ le500 │ 100 │ 100 │
└──────────────┴────────┴───────┘

The true improvement could possibly be just 2ms, shifting the request latency distribution and moving the bucket for p90 latency. In this case p50 latency will almost remain the same, 185 ms before optimization and 183ms after optimization. However p90 will show a massive drop from 291ms to 250ms.

It is always important to keep in mind that the Prometheus will perform linear interpolation to determine quantiles. Thus if you have buckets (less than equal to) 250ms, 500ms. And p90 reported by Prometheus is 291ms, the true p90 value could lie from anywhere between 250ms and 500ms, 250ms < p90 <= 500ms.

Prometheus histograms are hence not an effective tool for precisely measuring latency especially around bucket boundaries. It is better suited to observing latency across a wider range. Performance improvements could instead be measured with load testing tools like locust.

One might argue that we can instead use finer resolution by increasing the number of buckets, and then use Prometheus for measuring latency. However, increasing the number of buckets will lead to an increase in the amount of data that needs to be scraped and then stored in the time series database. This will slow down Prometheus and cause scaling issues, as it needs to handle the computation and retention of metrics for several other microservices as well.

Following is a screenshot from one of our production services before and after optimization. We use the default buckets here; buckets of concern are 100ms, 250ms, 500ms and 750ms.

Despite apparent significant improvement (150ms) in the p50 latencies (cyan) there isn’t significant improvement (50ms) in the p90 latencies (yellow) due to quantile errors.

Post optimization, our latencies of interest p50 and p90 both lie in the range around 250ms and 500ms, and close to bucket boundaries, and we are hence unable to make a good estimate of true optimization due to lack of resolution. The only correct conclusions that can be can be drawn in this case are as follows:

┌─────┬───────────────────────┬───────────────────────┐
│ │ Before │ After │
├─────┼───────────────────────┼───────────────────────┤
│ p50 │ Between 250 & 500ms │ Around 250ms │
│ p90 │ Between 500ms & 750ms │ Between 250ms & 500ms │
└─────┴───────────────────────┴───────────────────────┘

Prometheus in this case can be used to confirm that the service was optimized and give a qualitative estimate, i.e, whether the optimization worked or it didn’t. However only a load test in the development environment with a load test tool can give a quantitative estimate, i.e, the exact amount of latency decrease in milliseconds.

#3: Buckets and SLO

If you have a SLO, say p50=250ms. If you use the default buckets 125ms,250ms and 500ms and true p50 latency of your service is say around 240ms with most requests taking between 230–245ms, your monitoring system will be reporting a latency with p50=187ms. Giving you a false assurance that your service is well within its SLO, while in reality it’s almost on the margin.

Now if a developer were to say add another operation increasing the latency by say 30ms. All of a sudden the dashboard would start reporting a completely different p50 latency. Although your request now takes between 260–275ms, the reported latency would be p50=375ms. In this case the true change in p50 latency would be around 30ms, however on the dashboard it would be a difference of 187ms.

Determining quantiles using linear interpolation and bad bucket configuration can make it seem like you have breached your SLO even when you haven’t. Continuing with our previous case, assume that the SLO was p90=375ms. Obviously none of our requests take more than 275ms, however if a quantile were to be calculated, p90 would be reported as 475ms.

One solution is to always have a bucket at your SLO latency. This way the dashboard would never report a breach of SLO when the SLO isn’t breached. Furthermore, it is important to have several buckets above and below near the SLO value so you can get a good idea of how well your service is doing with regards to its SLO.

#4: Histograms and Observing zero

While optimizing our ML Listing Service we were wondering if we would benefit from caching for one of our sql queries. To find out, we added a histogram metric for an SQL query in Django.

SQL_Latency = Histogram( ‘items_latency’,)
with SQL_LATENCY.time():
items=Items.objects.using(self.data_source_name).filter(id__in=ids)

To our surprise we found p50=2.5ms, p90=4.5ms, which led us into believing that there was no benefit with caching. In reality we were effectively observing a latency in microseconds and after linear interpolation its estimated value was p50=2.5ms.

So if you use a smallest bucket of le 5ms, even an operation of 0ms would be reported as having p50=2.5ms, making the operation appear extremely expensive.

In our case though, it masked the zero latency and made it harder for us to realize the fact that Django does lazy loading with SQL queries, i.e. queries are not executed until the results are required. So we had to update the code as follows to get the true latency:

SQL_Latency = Histogram( ‘items_latency’,)
with SQL_LATENCY.time():
items=list(
Items.objects.using(self.data_source_name).filter(id__in=ids)
)

Conclusion

Histograms are a very useful metric type when observing metrics for applications. Histograms allow us to compress several observations into a meaningful and easy to interpret metric.

However it is important to keep in mind the limitations when using histograms. Due to errors in quantile estimation histograms are better suited for observing an application and not measuring quantitative improvements when trying to optimize an application. Buckets for histogram must also be carefully chosen for the range of values that are expected to be observed, especially when used for monitoring SLOs. When interpreting a histogram metric type, we must remember the resolution of the buckets and the uncertainty in the quantile value reported due to linear interpolation.

References

--

--