Upgrading Pinterest operational metrics

Published in

Pinterest Engineering Blog

9 min readAug 15, 2019

Jonathan Fung | Software Engineer Intern, Visibility
Colin Probasco | Software Engineer, Visibility

Recently, Pinterest hit a major milestone: 300 million monthly active users around the world. To reliably serve this user base with our large content corpus, Pinterest’s engineering team maintains thousands of services all working together in tight cohesion. Each service emits metrics that are crucial for health monitoring and alerting systems. The Visibility team is responsible for maintaining the systems that support metrics collection, reporting, processing, and display.

Why Upgrade?

Up until now, Pinterest has used Twitter’s Ostrich library to collect metrics from Java services. Ostrich has been a source of tech debt for us due to its recent deprecation, forcing us to use old versions of Finagle. Most importantly, it hinders us from achieving accurate service-level aggregation of metrics because Ostrich only exposes summaries of histogram distributions. Consider the following example of a three-machine fleet. Each host reports the 90th percentile of their response time latencies (colored pink — 16, 6, and 8). It’s impossible to recover the true 90th percentile of the distribution (12) from the three sampled data points.

Three distributions from three hosts, each reporting one p90

The true p90 across all three hosts cannot be recovered from the sampled p90s

This poses multiple issues:

Operational metrics are tagged by-node, so the number of metrics to store grows linearly with Pinterest’s metrics use cases. This has led to a large amount of year-over-year growth in metrics storage requirements.
We currently average percentiles metrics across distributions to estimate the true percentile. This doesn’t necessarily produce an accurate representation of the true percentile metric, and critical alerts may depend on the metrics generated from these services. Incorrect percentile metrics may lead to false positives or, even worse, false negatives.

The solution to this is twofold:

We must deprecate the current way of collecting metrics and create an in-house metrics collector and reporter where we can control the internal implementation. Histogram metrics will be backed by the T-Digest: a histogram data structure that is compact, serializable, accurate and, most importantly, mergeable.
A language-agnostic metrics aggregation pipeline must be created to provide service-level aggregation support. In case by-node metrics are desired, there must also be an opt-in option to preserve node-level metrics for a short period of time.

These two projects will enable much more control over Pinterest’s metrics system. Metrics storage requirements will be massively reduced due to the elimination of unnecessary by-node metrics, and query performance will be improved due to less data being fetched in queries. Finally, accurate service-level percentile metrics will be available throughout Pinterest.

Pinterest StatsCollector

To deprecate Ostrich, we built the Pinterest StatsCollector, a Java metrics collector and reporter library. The StatsCollector needs to be able to thread-safely collect the three types of metrics (Counters, Gauges, and Histograms) and push them through the aggregation pipeline every minute to persist them in a time series database (TSDB). During development, we optimized for CPU performance and API compatibility with the legacy system in order to ease migration steps.

At a high level, the library is composed of the following:

StatsCollector: An interface to create counters, gauges, and histograms.
Stats: A thin static singleton class wrapper around a single StatsCollector. One per JVM. User makes calls to this object.
Metrics: Thread-safe implementations of Counters, Histograms, and Gauges
LogPusher: We use a push-based model for our metrics. This class pushes metrics from a StatsCollector to metrics-agent, a sidecar running on every Pinterest machine that is responsible for post-processing metrics and forwarding them to the appropriate destination.
ThreadLocalStats: A convenience interface for thread-local, batched metrics. Provides performance benefits.

API decisions

When designing the StatsCollector and Stats interface, we extensively profiled our Java codebase to find existing metrics code patterns. A few critical use cases and performance bottlenecks were found, and we designed our library to alleviate those issues. Here, we detail a few application-level and user-level optimizations.

Optimization 1: Caching synchronized hashmap lookups

First, we modified the process by which metrics were reported to a StatsCollector. To illustrate, consider incrementing a counter. The previous method involved calling incr directly on the Stats singleton. Thus, there is exactly one synchronized hashmap lookup during each incr call to access the metric name to Counter mapping, which leads to performance problems when executed for high queries-per-second (QPS) functions. Since there is exactly one mapping from metric name to Counter for each JVM, all the synchronized lookups are performed on one hashmap, and lock thrashing occurs because locking requires an expensive process-level switch from user mode to kernel mode.

To optimize, we moved towards a new API which effectively “caches” the synchronized lookup. The user now maintains one reference to the Counter object that the Stats singleton dispenses, and all metrics operations are performed on that object.

Optimization 2: Thread-Local Stats

However, our first optimization breaks down when it comes to the dynamic naming of metrics. Consider a use case where the user must increment dynamically named counters in a batched operation, such as a loop which processes many events. The previously mentioned method would not result in a performance benefit since a synchronized hashmap lookup is required each function call.

To make this case performant, we can remove all synchronization from the hashmap lookup and the metrics increment. We built Thread-Local Stats, a thread-local version of StatsCollector and its internal metrics, which does not possess any locking behavior. The Thread-Local Stats essentially acts as a bucket that stores metrics in a batch until the user flushes the Thread-Local Stats back into a StatsCollector. Since the Thread-Local Stats is thread-local, there is no synchronization during any metrics operations, optimizing out the resources necessary in the switch from user mode to kernel mode.

Optimization 3: Gauge API design

As opposed to a counter, which is used to measure the occurrence rate of an event, a gauge can be thought of as a function that monitors a certain value. The classic example of a gauge is a function that monitors the size of a list. After a gauge is initialized, its value will be polled and reported each minute when LogPusher pushes metrics.

The legacy way to initialize a Gauge is to inline an anonymous class of type Java Supplier, or a Scala Function0. These can be thought of as parameterless lambdas outputting one double value. This is cumbersome for the user and not performant, increasing the memory footprint of the program being monitored. A reference to the object being monitored must be maintained inside the anonymous inline class, which prevents the JVM from ever marking the object as eligible for garbage collection since an object is only garbage collected when all references to the object are lost. This is not desired behavior, as monitoring systems should never influence performance of the client.

To optimize, we borrowed inspiration from the Micrometer library and introduced the Java WeakReference object to our Gauges. Weak references to an object don’t protect from garbage collection. It is the responsibility of the library user to maintain a strong reference to the object being monitored by a gauge. When the user is done monitoring and drops the strong reference, the JVM garbage collector is allowed to collect the object. With this improvement, gauges in our monitoring system are completely transparent to the user and don’t take up additional memory.

Metrics Aggregation by Service (MABS)

With metrics flowing through our in-house Pinterest StatsCollector, we can now exercise more control over how metrics are processed. This allows us to develop the language-agnostic MABS pipeline to provide a way for services to aggregate metrics. The completed pipeline will allow for all nodes in a service to report their metrics through MABS, which will aggregate the by-node metrics and produce one data point per metric. We also allow users an option to store by-node metrics.

At a high level, MABS is composed of:

Metrics-Agent: The Pinterest sidecar must accept a new command statement and perform necessary processing before forwarding to the appropriate Kafka topic.
Kafka topics: Kafka is used as the streaming buffer between the components of MABS.
Spark aggregator: A constantly running job that aggregates metrics.
Ingestors: Services that push data to storage systems.
Time series database: The Storage & Caching team at Pinterest maintains Goku, our in-house TSDB. Goku supports multiple storage tiers, perfect for our short-term by_node metrics.

Usage

A service communicates to Metrics-Agent by sending a plaintext MABS command over a TCP connection to the local Metrics-Agent. Counters and Gauges are sent over the wire as integers and doubles, respectively. Histograms are sent as a Base64-encoded serialization of the backing T-Digest. An optional by_node=True command tag will flag host-level metrics for short term retention.

After input, Metrics-Agent performs the appropriate post-processing and forwards the metrics to the appropriate Kafka topic. The Kafka topic buffers metrics into a Spark aggregation job, and the output of the Spark aggregation job is sent to an Ingestor that writes data points to Goku TSDB.

Impact

Pinterest StatsCollector and MABS provide substantial benefits to our metrics pipeline. By migrating services to the MABS pipeline, we’re able to vertically aggregate metrics across an entire service. This drastically reduces the amount of data required to be stored in our TSDBs.

Only the true p95 is stored, rather than one p95 for every fleet machine.

By deploying MABS to internal Pinterest metrics, we were able to remove the excess host tag in our metrics. The resulting reduction in metrics dimensions produced up to a 99% savings in metrics storage.

Apart from storage volume and operational cost reductions, MABS finally gives Pinterest access to accurate Histogram metrics. Percentile metrics are incredibly important to monitoring the health of services. With accurate metrics, false positive and negative alerts can be reduced. The below graph is a real example of latency measurements that were improved through MABS. The red line is the legacy way of reporting histogram metrics, aggregated by the max function, while the blue line is the true metrics reported through MABS. MABS metrics are less jagged, provide a more accurate representation of the true picture, and, most importantly, won’t falsely trigger alerts.

MABS pipeline does not have false spikes

Conclusion

Metrics play an important role in any software company. Without a dependable way to report and display metrics, software engineers would be left blind — imagine flying an airplane without any of the speed, heading, or altitude gauges! MABS is the next step in making our metrics system more scalable, robust, and accurate. Our infrastructure currently supports a product that inspires more than 300 million Pinners a month to lead a life they love. As Pinterest moves toward the next 300 million Pinners, the Visibility team will constantly strive to build upon and improve the infrastructure that provides Pinners with a seamless, inspiring browsing experience.

Spending a summer as an Engineering Intern at Pinterest has been an incredible experience. There was an extraordinary amount of learning, responsibility, and fun packed into a short three months. Interns get the opportunity to work on real, impactful projects (like MABS!) that provide endless opportunities for growth and valuable exposure to a multitude of technologies. If you are a university student reading this, definitely consider applying. Feel free to reach out to me with any questions!

Acknowledgments

Colin Probasco for mentorship and support during the project. Thanks to Brian Overstreet, Naoman Abbas, Wei Zhu, Peter Kim, Humsheen Geo, and Dai Ngyuen of the Visibility team. Thanks to James Fish, the Storage & Caching Team, and the Logging team for inputs during design review. Thanks to Kevin Lin for helping with dev tooling.