X-Ray’s service map, great for seeing the status of all of your services quickly

Why We Chose AWS X-Ray (for now)

Jim Shields
YipitData Engineering

--

This is the second post in a series about YipitData’s ongoing experiment with distributed tracing.

In the first post, I made the case for why distributed tracing is valuable even for a small engineering team.

In this post, I’ll make the case for why we chose AWS X-Ray as the lowest-cost, lowest-friction (if not necessarily most powerful) solution for storing and visualizing distributed traces for our webapps and async workers.

A Brief History of Observability at YipitData

Sentry for Exceptions

We’ve long relied on Sentry to understand exceptions in our apps. This has been effective, but can be limited: it only gives visibility into exceptions, not when the code is working, so doesn’t give visibility into, for example, latency.

(Sentry has recently added APM / distributed tracing to their offering, which we haven’t investigated yet, but maybe an idea for the future!)

Structured Logs for Scrapers

To collect publicly available web data to answer investor’s key questions, we run lots of web scrapers. For observability into these systems, we’ve relied on metrics and structured logs.

These logs give visibility into scraper performance and errors to our analysts, who can query & visualize them using SQL & PySpark in Databricks.

Structured Logs for Webapps

Recently, our engineering team has increasingly focused on building more apps for our teams and for our clients. As the number and complexity of these apps increased, we recognized the lack of app visibility as a problem.

At first, we built a lightweight, custom solution, very similar to our web scraping solution: a middleware for Django & Flask to stream structured logs of our app requests & responses to our data lake. This was potentially very powerful and provided a lot of raw data we could analyze.

Though this approach worked great for our scrapers, it was limited for our apps, which have more dependencies and complexity, and rely on async workers to do a lot of data processing.

We found a few areas for improvement:

  1. You had to know how to query the logs, resulting in a “pull-based” workflow for engineers
  2. The data structure was restricted to rows, so we had to flatten everything that happens in a request / response
  3. Only web requests were logged, but much of our most complex and slowest code happens in our async workers

Distributed Tracing

We started investigating tracing as a solution to those problems late last year. We thought it provided a few key advantages over structured logs:

  1. While structured logs were restricted to rows, traces have a flexible nested structure, which makes them ideal for storing data about a request / response
  2. We knew we could apply the trace data structure to the code executed in asynchronous workers, where we have a major visibility gap
  3. Many of the tools that support tracing come with pre-built visualizations of service maps, request status across services, and easy querying that enable more frictionless visibility

Finding a low-cost, low-friction solution

First, we researched more full-fledged observability & monitoring tools (usually branded as APMs, or Application Performance Monitoring).

We tested Honeycomb, DataDog, and NewRelic, and found that while some can be very powerful (to varying degrees), they’re not cheap — often as or more expensive than running the machines!

We’ve spent lots of time cutting our tech spend, and have oriented our team culture around cost awareness, so a potentially expensive new platform would be a hard sell.

In addition, tracing and observability are new concepts for our team, and we haven’t (even now) fully bought into their value for our small team and relatively small apps. So we recognized that using a new platform outside of AWS, on top of these new concepts, would introduce user friction we couldn’t afford.

AWS X-Ray (+ a little Yipit code)

We landed on AWS X-Ray as a low-cost, low-friction first step toward tracing. It might be less powerful than other APMs, but we decided it was the easiest option for proving the value of distributed tracing.

Though X-Ray introduced some small infrastructure challenges (they could easily comprise another post), we got it working well with our internal infra platform (YAWS, aka Yipit AWS), so that turning on the infra integration is a simple checkbox for our engineers.

AWS has a Python SDK for sending traces for web frameworks that’s easy to use — and, critically for us, automatically adds calls to AWS to traces—but it has some standard config we don’t want our engineers to have to remember every time. So we created a very small library to expose a one-line Python installation for Django.

This was a good start, but didn’t solve the hardest and most important problem: tracing our workers.

Tracing Workers

Typically, traces are sent from webapps, spanning a synchronous request & response. But some of our most critical, slow, and complex work happens in asynchronous workers, outside of a request & response.

A few years ago, we developed an open-source library, pyqs, to be our simple task manager, backed by SQS. We use this to run most of our asynchronous work, so it’s a good place to add a tracing integration.

To give ourselves the option to switch to other tracing solutions in the future, in the case that X-Ray isn’t powerful or easy enough, we added hooks to the start and end of each task, rather than add an X-Ray specific integration.

Then, similar to Django, we created a small library for a one-line pyqs integration, so users can turn on X-Ray in Django and pyqs with ~5 lines of code in the same place.

This is important for us — it makes this experiment a much easier two-way door if we ever want to disable X-Ray or move to another tracing solution.

Results: In Progress

If a user adds the above to their app (and checks the box on YAWS), they’ll get some useful visualizations for free in X-Ray & ServiceLens:

ServiceLens ServiceMap: for a webapp that talks to a database, S3, and has an async worker
Async worker dashboard: Latency, requests, # of successes / failures
Async worker trace view (works for webapp request / response too)

This is still a work in progress, but has already given a few teams better visibility into their app performance, errors, and dependent services.

X-Ray has its downsides (30 days of data retention, a maximum 6-hour lookback, somewhat difficult querying, two very similar consoles), but the visualizations and pre-built dashboard (through Cloudwatch ServiceLens) clearly show the potential of distributed tracing.

We’re excited to see whether X-Ray can make app performance debugging much easier for our teams, especially as we build more apps to make our entire company and our clients more effective.

--

--