This is a detailed topic worthy of its own independent post. I have many things to say about it.
Yes, alerting on specific percentiles going over thresholds (such as when latency exceeds 100ms at the 95th Percentile) is a good way to detect real incidents.
You ask what are the shortcomings: The biggest one that I have seen is that you often end up monitoring the wrong thing if you don’t have a good SLI in the first place. To give an example:
“This system provides an SLO of 10ms at the 50th percentile, and 50ms at the 95th percentile.”
But your system is precariously balanced between 3 types of queries: One is on a fastpath because the data is always cached, and that’s 70% of queries (sub 10ms), 28% of queries are on a slow path where a consistent read between datacenters is required but no data processing is required (sub 50ms), and 2% are totally cold, requiring the data to be recalculated and pushed back into the storage system, which takes 1–2 seconds.
So what you’ve actually done is said: “Alert when less than 50% of requests are cached, or when more than 5% require reprocessing”. But done it in an extremely obtuse way.
As for the maths on how to calculate percentiles for monitoring: it’s actually quite easy. In the trivial case where you just want to calculate the mean of a value like a latency, you record the total count of requests, and the total count of time elapsed, and divide one by the other.
To get percentile figures, you pivot on this technqiue: You simply count the number of requests that took took place, and then also how many requests took between 0ms and 1ms, between 1ms and 2ms, 2ms to 3ms, etc. This data can then trivially be used to get an approximation of any percentile, or give you a histogram of latencies, or if plotting a timeseries, a heatmap.
I have not started my commentary on chapter 5: Monitoring Distributed Systems, but it’s going to be interesting when I get there: https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html