Distributed Tracing at Massive Scale

LiveRamp
LiveRamp Engineering
6 min readAug 5, 2019

When a developer thinks about monitoring and observability of their production application, two things generally come to mind: metrics and logs. While those are really useful for debugging and monitoring purposes, there is still a critical monitoring element that remains missing.

Metrics and logs do not (easily) show us how a request moves across multiple services, packages, and infrastructure components — this is where distributed tracing comes in. Distributed tracing gives us granular data about a request’s journey through traces and spans. Traces are records of the entire lifecycle of a request, including each service and package it interacts with (e.g., NGINX, application, BigTable, etcd, Pub/Sub, etc). Since LiveRamp’s application is built on a HTTP middleware stack, it is even more useful to see the request flow through these middlewares.

Here’s an example of a trace, visualized by means of a flame graph in Datadog APM:

Each trace is made up of one or more spans: a span is a unit of work that is up to you to define; in our case, each span represents an HTTP middleware. Spans can include information such as the status of an HTTP request, the name of the service executing the operation, timestamps, links to other spans, and any additional metadata that the developer thinks can be useful for debugging at a later stage.

The spans (light green) show the order in which the request flows through the different middlewares, as well as the time spent in each of them. Imagine having such visibility available to you while you are debugging a production incident: traces make it incredibly easy to identify bottlenecks and diagnose high latency.

Let’s have a quick Q/A session to know more about Distributed Tracing (DT):

Q: So far we’ve heard a lot of great things about you DT, and none bad. Are you really that good?

DT: Oh thank you, let me be really humble here. I’ve a lot of gotchas, not many people know about them, but I’m gonna tell you. First things first, I’m considerably memory and CPU expensive and you can’t just hope to trace all incoming requests, especially if I’m instrumented on a high-traffic application. So you gotta choose a small fraction of your request volume to do tracing over.

Q: So we can only trace a limited number of requests!? How do we choose those limited requests? And since you’re so damn expensive, we’ve gotta make that sampling decision as the request comes in and cannot do so when it’s going out of the application. Since my application is mostly serving 200 OK responses 99.9% of the time, we’re more interested in tracing requests that we respond with a 4XX or 5XX response. Are we just supposed to roll a die and get lucky?

DT: Until recently, that’s exactly what you were supposed to do; get lucky.

Q: What changed recently?

DT: Earlier, you had to make the sampling decision at the very beginning of the request flow, which is also referred to as “head-based sampling”. Recently, OpenCensus, an open-source platform for metrics collection and tracing, came out with a solution for intelligent tail-based sampling. In contrast to head-based sampling, the sampling decision here is made after the entire trace has been collected. There are currently a few ways to enable this (source):

  • rate limiting: the maximum number of spans per second to export
  • string tag filter: traces with the specified key/string-value tags are exported
  • numeric tag filter: traces with the specified key/numeric-value tags are exported
  • always sample: send all traces as complete traces

Hence, now you can instrument any request of your choosing with specific key-value pairs and send them to any supported backend for trace analysis!

Q: Wow DT, you just got a major upgrade with tail-based sampling. We hear companies like DataDog have started offering similar solutions, where you can essentially filter the traces you want on their UI directly, saving the trouble of setting this up via the `oc-collector`. What is your take on that?

DT: Good question. Yes, it is effectively the same, but it comes at a cost — the cost of egressing large amounts of trace data from your application’s infrastructure over to those companies. `oc-collector` saves you on that egress cost by filtering out traces in your own infrastructure, sending over only the traces that you need to those backends.

Well thanks for your time DT. We wish you loads of tail-sampled success.

Great, now that we know a bit a Distributed Tracing, what does LiveRamp use it for?

At its peak, LiveRamp’s pixel server receives close to 200,000 requests/sec. We’re currently operating our Tracing Infrastructure successfully, and have a host of use cases for distributed tracing:

  • Used to power an internal application that allows our Technical Services staff to dynamically enable traces for particular types of requests. For example, tracing requests from particular clients/partners by injecting specific tags into requests coming from those partners.
  • Used as a primary debugging tool when fighting production issues. Particularly helpful since we’ve setup thorough tracing of 4XX/5XX requests, thanks to tail-based sampling. Example: allowing us more visibility if our server is issuing more 5XXs than it normally does, and by allowing us to look at the trace and point out problems directly in the code itself.
  • Debug third-party services not owned by LiveRamp. For example, if our production app needs to do a real-time lookup against Google’s BigTable, we want to know and take appropriate action when BigTable latency increases beyond a reasonable threshold. Tracing allows us to associate metadata with each request so we can correlate latency values with pertinent metadata, e.g. certain keys could be causing longer lookups.
  • Looking at normal request flow: it’s just great to actually be able to see the flow of a normal request through your application, and to be able to visualize the flow in terms of parent and child spans.

Why did we opt for OpenCensus?

  • OpenCensus is pluggable because you can always swap out tracing backends.
  • OpenCensus is easy to use because you can use OpenCensus plugins to instrument your entire server with simple configuration instead of having to create custom traces for everything from the get-go.
  • OpenCensus is driving forward what the propagation binary wire format is for distributed traces.

OpenTracing and OpenMetrics are other ununified open standards and frameworks that don’t put together the entire observability featureset, where OpenCensus fulfills both tracing and metrics as part of observability, and is working on logging, too.

Recently, there has been a joint venture by leadership teams from OpenTracing and OpenCensus. They are migrating their respective communities to this single and unified initiative, called OpenTelemetry. As it matures, we look forward to adopting this new standard within our systems as well.

We really hope this post was useful for you, and motivates you to try out Distributed Tracing yourself!

Keep an eye out for how we implemented distributed tracing in Part 2!

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

This post is written by Gurkanwal Singh Batra with collaboration from E-Lynn Yap and Julian Modesto, software engineers. Curious to learn more? We’re always hiring.

--

--