Distributed tracing at Pinterest with new open source tools

By Suman Karumuri, Pinterest engineer, Visibility

The Pinterest backend consists of hundreds of microservices deployed across tens of thousands of machines. A simple request to the Pinterest home feed can make hundreds of network calls to dozens of backend services. While we know how long a request takes to execute, previously, we didn’t know why a request was slow. To collect this data, we built Pintrace, a distributed tracing pipeline that tracks a request as it’s processed across our Python and Java backend services. We also built Pintrace collector to help us debug performance issues. Pintrace collector is a Spark job that reads spans from Kafka, aggregates them into traces and stores them in an Elasticsearch backend. Today, we’re open sourcing Pintrace collector on Github, and contributing the code to the OpenZipkin community. By releasing it to the community, we hope others can benefit from the project as much as we do.

Background

Tracking down the source of latency can be very challenging because the slowness can be caused by any one of our backend services. Our metrics (OpenTSDB) and logging (ELK) infrastructure provide a coarse-grained view of how the requests are processed on the backend, but before Pintrace, it was hard to pinpoint why a request was slow.

To solve this challenge, we turned to distributed tracing, which tracks latency issues and shows where a request spends its time. It provides fine-grained visibility into request execution by capturing request causality information (i.e., which downstream requests were executed as part of a request) in addition to request latency information. Pintrace is our in-house distributed tracing pipeline that tracks a request as it’s processed across our Python and Java backend services.

Before we look into the details, let’s define some terminology. Annotations are events occurring during a request. For example, sending a request to the server is an annotation. A span represents a logical operation and contains a set of annotations that belong to that operation. For example, a RPC call consisting of “client send” and “client receive” annotations is represented as a span. A trace is a graph of spans. A request execution is usually captured as a trace. All spans in a trace have the same trace ID. For more details on how traces are structured, check out OpenTracing’s introduction.

Pintrace: a tracing pipeline

Building a distributed tracing infrastructure involves setting up a tracing pipeline. A tracing pipeline consists of two main components–source code instrumentation and trace processing backend.

The source code instrumentation is responsible for tracking the request across services and generating events (spans) as the request is processed along various services. The trace processing backend is responsible for aggregating the generated spans, processing them into traces and storing them. The stored traces can be searched and visualized in a UI.

The Pintrace tracing pipeline consists of several components shown in the following diagram. The components in blue are built in-house.

Source code instrumentation

Source code instrumentation is the bulk of the work when setting up a tracing pipeline. The instrumentation is responsible for generating spans that belong to the same request with the same trace ID. To achieve this, the instrumentation does three tasks: (a) propagating the trace ID along with every request (b) generating the spans for every logical operation performed by that component and © sending the generated spans to the backend. The first two responsibilities are implemented by a tracer. A logger implements the third responsibility. At Pinterest, we log the spans to Kafka using an analytics tool called Singer.

Pintrace backend consists of Thrift services written in Python, Java, Go, Node and C++, each using different frameworks. To trace a request across these services, we implemented a tracer for each language framework combination. Pinterest has a common framework per language, and most services we’re interested in are written in Python and Java, so we only had to instrument Python and Java frameworks. When we first started building Pintrace, the initial goal was to track down network request latency, so we focused solely on capturing the network activity of a request as spans in Python and Java applications.

Python instrumentation

Our Java services are fronted by a monolithic Python application called ngapi, a HTTP service that handles all incoming requests to our backend. To serve a request, ngapi talks to several Java services over Thrift and implements a custom web framework based on the Python gevent library. Since it’s a custom framework, we implemented our own tracer to trace the requests passing through it.

We wrote a tracer for our web framework using the open tracing API. The tracer propagated the trace ID from the HTTP interface to the Java Thrift service using a custom Thrift wrapper provided by Finagle. In addition, the tracer generated spans for all Thrift, Memcache and HTTP requests. To report these spans, we wrote a logger that took the generated spans, converted them to Zipkin format and logged them to Kafka using Singer.

Since ngapi is managed in a monorepo and deployed as a monolithic application, the change to enable tracing is deployed in a single step across all frontend services.
 Java tracer and logger

Our Java services use an in-house framework called “service framework,” a wrapper around Twitter’s Finagle library. Since Finagle ships with a Zipkin tracer, we didn’t build a new tracer for our Java services. By default, Finagle tracer writes the gathered spans to scribe or a log file, but since we wanted to write our spans to Kafka using Singer, we wrote a span logger that logs the spans to Singer.

Once the tracer and logger are in place, we enabled them in our service framework. However, our service framework lacked a way to enable tracing in all apps at once. To avoid duplicating this logic in more than 100 applications, we updated the service framework to have a global config and enabled tracing in it. Then we updated our Java apps to use this global config from the updated service framework. The hardest parts of enabling tracing in Java was making the change in hundreds of apps and working to deploy the changes across the organization.

Sampler

Even though our client instrumentation has really small latency overhead, there’s double digit computation overhead added by capturing the spans and logging them. To reduce this overhead, as well as the cost of storing the generated traces, we only trace a small set of the overall requests.

A sampler component decides which requests should be traced and randomly selects between 0 to 1 percent (typically 0.3 percent) of all of our backend requests. This rate can be dynamically adjusted using the decider framework. We also sample all requests on development machines so developers can use traces as part of their day-to-day activities.

Trace processing backend

The trace processing backend, as shown in the diagram above, is responsible for aggregating the spans from thousands of machines in our cluster, processing them into traces and storing and visualizing them.

Singer-Kafka pipeline

The Singer-Kafka pipeline is responsible for aggregating the spans generated by several processes across thousands of machines and writing them to a Kafka topic. We use Singer as our span aggregation pipeline. The Python and Java loggers log the spans to local files. The Singer daemon installed on each host tails these files and writes the spans to a Kafka topic.

Pintrace collector

We used the OpenZipkin backend to ingest the spans, but the captured traces revealed a few issues. For example, some of our instrumentation can be faulty. In those cases, we’d like to fix the faulty span or drop the span until we fix the instrumentation. Even with careful sampling, at our scale, we see a lot of similar traces. Instead of storing those traces, it’d be more cost-effective to further sample the traces at ingest time and only store high-value traces.

These features were challenging to add to our backend, so we built Pintrace collector, a Spark job that reads spans from Kafka, aggregates them into traces and stores them in an Elasticsearch backend. We chose Spark for implementing the pipeline since it’s flexible and scalable enough to implement the needed filtering, grouping spans by trace ID and aggregating spans over time windows for analytics purposes. As an added bonus, the Spark job allows us to run real-time analytics on the spans without storing them.

The image below shows the internal architecture of Pintrace collector. StreamFactory is a extensible interface that ingests data from Kafka or any other transport. The ingested spans are then processed to ensure they’re properly formed using a data quality checker. Optionally, the filtering step filters spans based on criteria like service name or annotations on the span. The aggregation phase groups the spans by time or trace ID. The final consumer stage persists the data to a storage system like ElasticSearch service.

Zipkin UI

As shown in the architectural diagram, we use Zipkin UI to search and view the traces stored in the ElasticSearch cluster. We’ve also contributed a few bug fixes to the Zipkin UI.

Open sourcing Pintrace collector

Over the last year, we’ve noticed others in the OpenZipkin community looking for the same solutions we implemented in Pintrace, such as more advanced sampling and stream processing to support Vizceral visualizations. We wanted to share our work with the community while also leveraging its collective expertise. In order to be as open as possible, we’re contributing our code directly to OpenZipkin under zipkin-sparkstreaming code base. We encourage you to watch or collaborate with us in building Zipkin-spark. We hope you find the Pintrace collector as useful as we do. We can’t wait to see new ideas from the community.

Acknowledgements: The contributors to this project are Naoman Abbas, Phoebe Tse, Ashley Huynh, Alejandro Garcia Salas, Emmanuel Udotong, Brian Overstreet, Xiaoqiao Meng and Suman Karumuri. We would also like to thank Adrian Cole for his feedback and advice during the project.