Distributed Tracing for a Small Team

Jim Shields
YipitData Engineering
4 min readAug 5, 2020

--

Is distributed tracing valuable for a small engineering team?

This is the first in a series of posts about our ongoing experiment with distributed tracing at YipitData. There will be at least two more, one about why we chose AWS X-Ray for now, and one about how we’re integrating X-Ray into our developer experience.

Key Learnings

  1. Introducing an entirely new concept (like distributed tracing) to an engineering team is especially challenging
  2. It’s even harder when the value of the new concept isn’t apparent without seeing it work reliably for your own app in production
  3. Reducing friction as much as possible with a new concept is key to driving usage and value
  4. Sometimes the “best” solution isn’t necessary

Our Engineering Team: Empowering YipitData

At YipitData, our product includes data, research, and analysis for investors and companies. Because of this, we have very few public apps and small (<100) user bases, so our engineering problems tend to be different from larger-scale consumer app companies.

Our engineering team primarily focuses on empowering the rest of the company with technology by building internal tools for data collection, data analysis, publishing, data delivery, and client engagement, among other areas.

New Teams, New Apps, No Visibility

We have about 20 engineers, most of whom are on 5 small teams focusing on different use cases (e.g., data collection, data engineering, publishing, engineering infrastructure). We recently reorganized our teams, and multiple engineers took on new, unfamiliar codebases.

While this was an exciting change, engineers new to their codebases had lots of questions about what’s actually happening in their apps. This was especially true for apps that rely on workers to do data and report processing asynchronously. While these workers are valuable to keep the apps interactive and reliable, they can be especially slow and hard to debug.

In addition, our product team relies on our tools to be able to quickly and iteratively analyze and publish our data, so it’s important that engineers can understand the causes of failures and slow responses in those tools, especially as the tools grow in complexity.

This problem was well-suited to our infrastructure team (which I’m a part of), whose mission is to “enhance engineers’ ability to quickly develop & release solutions that are secure, reliable, scalable, and cost-efficient.” Ultimately, our goal with this project has been to enable faster debugging and bugfixes, especially of performance issues, which should enable better reliability and usability for our end users and easier maintenance and development for our engineers.

Can Tracing Help?

In researching this problem space, it became clear that many engineering teams use distributed tracing to enable this kind of app observability, especially with many microservices. But distributed tracing has a difficult reputation. Here’s a perfectly concise summary from Cindy Sridharan, an expert on distributed systems:

Distributed Tracing is often considered hard to deploy and its value proposition questionable at best.”

(from Distributed Tracing — We’ve Been Doing It Wrong)

We have a small infrastructure team (currently 3 engineers), and a small engineering team focused on building tools for their users, so a hard-to-deploy and manual-to-instrument solution would almost definitely go unused.

That said, we were convinced that if it could be easy to deploy and required little to no upfront instrumentation, engineers would be able to unlock value quickly. So we focused on making the deployment as frictionless as possible.

The most difficult roadblock, which only became apparent after testing with some of our engineers, was communicating the value of tracing — we realized the only way to so this effectively would be to deploy it on their apps in production.

Goal: Easy Observability With <10 Minutes of Setup

After learning about the potential value and challenges of tracing, we had a stretch goal of enabling teams to visualize both individual traces and their app’s service dependencies without having to manually instrument every service it talks to (aiming for <10 minutes of setup).

Viewing individual traces is valuable for debugging performance issues, and viewing the dependencies of, a single request (and, we hope, of a single task execution for async workers):

Trace view (actual trace view from Jaegar, an open-source tracing tool)

The service map view is valuable as a shared visualization of what’s actually happening, to reinforce a mental model of the app’s architecture.

Service map view (basic example I made up)

The hope is that enabling both views, along with easy querying and filtering of individual traces, with minimal instrumentation will enable significantly faster and more reliable debugging and development.

In the next post, I’ll talk about how we evaluated the existing solutions, and why we chose AWS’s tracing offering, X-Ray, as the most frictionless intro to tracing for us right now.

Thanks to Hugo Lopes Tavares (https://twitter.com/hltbra) for his thorough and thoughtful review.

(We’re not hiring on the engineering team right now, but if alternative data analysis sounds interesting, we’re hiring for lots of other positions!)

--

--