Lowering Microservice Latency: A tale of misinterpreted percentiles

Dhananjay Sathe
Nov 9 · 6 min read
Photo by Lucas Santos on Unsplash

Introduction

Thumbtack is a technology leader building the modern home management platform. Millions of customers visit Thumbtack to find professionals to fix, improve and maintain their homes. Our search ranking algorithms rank the professionals in categories like “Emergency Plumbing” based on factors that improve the overall customer experience. Other than a well-ordered list of professionals, customers also care about how quickly that list loads. We strive to provide the best customer experience by continuously working on improving the latency of our search.

A search request is generated when a customer interacts with our search bar and submits a search. The fulfillment of this search request goes through multiple steps. Service enrichment is one of the highest latency steps involved. In this step, we call multiple microservices concurrently to collect data about professionals, this data is used to rank and filter professionals in further steps. The components that communicate with these microservices and get the data are called enrichers. In this post, we will share our approach to methodically improve service enrichment latency. We will talk about how we discovered the right set of candidate enrichers to work on, by logging some smart metrics. In the process of this discovery, we will share an interesting example of how percentile latencies can get misinterpreted.

A percentile is a value, below which a given percentage of measurements in a set falls. For example, a 95th percentile value (P95) of 150 milliseconds (ms) means that 95% of all values are 150 ms or less. Similarly, the 50th percentile value (P50/Median) of 45 ms means that 50% of all values are 45 ms or less.
It is a leading industry practice to use percentile latencies instead of average latencies. Average latencies tend to get skewed by outliers when the dataset is small. And when the dataset is large they hide important signals like “5% of users are experiencing slow responses”. On the other hand, tail latencies i.e 95th or 99th percentile latencies tell us the best response time for 5% or 1% of our slowest requests, thus giving us meaningful targets to work towards.

Approach

As the enrichment process was the slowest component in our flow, we decided to improve it. As we were running enrichers in parallel we had the notion that unless we fix the slowest enricher we would not make significant latency gains. So the hunt for the slowest enricher began.

We generally look at percentile latency numbers to determine the performance of any component. The graph below shows that the total P95 latency for the overall enrichment phase is ~96.8ms and the P95 latency for the slowest enricher is ~53ms.

Theoretically, as we run enrichers in parallel we were expecting the total latency to be closer to the slowest enricher latency.

We were well aware of the fact that max(P95(of each enricher latency across searches)) is not always equal to P95(max(of enricher latencies within a search)). However, we did not expect to see a 44ms (> 40% of total) difference between max(P95) and p95(max) in our latency metrics. This made us think that we might be missing some component in our tracking and something was unaccounted for.

To investigate this latency gap, and to make sure that the untracked portion of code is not the reason for this gap, we added latency measurements for the code before running enrichers in parallel (Pre Processing) and for the code, after completion of all enrichers (Post Processing), as that was the only individually untracked code in the enrichment stage.

This investigation showed that the P95 latencies for pre-processing and post-processing add up to just 7ms and we still have unaccounted 37ms. To investigate this further, we explicitly measured the latency of the slowest enricher by taking the maximum across all enrichers in each request. This led us to determine whether there was any unaccounted latency between the slowest enricher and the total enrichment latency for a given search.

Plotting this showed us that the total latency is always governed by the slowest enricher as expected theoretically and the gap we saw was not due to untracked code but due to the way we looked at percentile latencies.

In this real-world example, the exercise of adding additional measurements on the slowest enricher (i.e max) helped us prove that we are seeing a case of

In addition, this also showed that these numbers are not only different but could be far apart from each other. Assuming these numbers are interchangeable would be a mistake.

Finding Optimization Candidates

Now coming back to our initial hunt for enrichers for performance improvements. The graph below shows that Enricher A is the enricher with the highest individual Percentile latency numbers.

Should we necessarily only look for the enricher which has the worst individual latency numbers?

To answer this, using the measurements that we started logging in for comparing max(P95) and P95(max), we plotted the statistics on how often an enricher becomes the candidate that slows down the enrichment process.

The plot below shows the number of times a particular enricher was the slowest for each request out of slow requests where the total enrichment latency was greater than our P95 aggregate enrichment latency.

This showed us that, even though enricher A has the highest individual P95 percentile number it is not the only enricher contributing to or responsible for the highest percentile latencies for the overall enrichment process. This was an eye-opener. Until now, it seemed like the enricher A was the bottleneck for our overall P95 numbers but in fact, it is not exactly the only enricher we need to optimize. Multiple enrichers contribute to the overall P95 number, and the contributions are higher for Enricher G and Enricher D.

Conclusion

Before this, Enricher G and Enricher D seemed to have lower latency numbers (~28.9ms and ~18.2ms P95) and thus were never considered as optimization candidates. Further investigation showed that these enrichers have very high tail latencies (P99.x) and hence were responsible for slowing down a fraction of the overall slowest 5% requests individually.

This exercise showed us that to improve overall P95 we will need to fix extreme tail latencies of multiple individual enrichers instead of focusing on improving the P95 of a single enricher.

Future Work

The Tail at Scale, a paper published by Google employees in 2013 talks about 7 techniques to deal with tail latency: Hedged requests, Tied requests, Micro partitions, Selective replication, Latency-induced probation, Good enough responses, and Canary requests. We are currently trying out the hedged request technique, and it has helped us improve tail latencies in some of our enrichers. We plan to further explore other techniques to improve latencies across different components at Thumbtack. If the problem of optimizing complex distributed systems seems interesting, come join us! We would love to have you aboard.

Acknowledgment

This work would not have been possible without help from Oleksandr Pryimak, Mark Andrew Yao, as well as others in the Marketplace org at Thumbtack. Special thanks to Navneet Rao, Richard Demsyn-Jones, Bharadwaj Ramachandran, and Karen Lo for feedback on this post.

About Thumbtack

Thumbtack (www.thumbtack.com) is a local services marketplace where customers find and hire skilled professionals. Our app intelligently matches customers to electricians, landscapers, photographers, and more with the right expertise, availability, and pricing. Headquartered in San Francisco, Thumbtack has raised more than $400 million from Baillie Gifford, Capital G, Javelin Venture Partners, Sequoia Capital, and Tiger Global Management among others.

Thumbtack Engineering

From the Engineering team at Thumbtack