Distributed Tracing for Ruby on Rails Microservices with OpenCensus/OpenTelemetry (part 1)
This series of articles is based on a talk I gave at RailsConf 2019, titled “Troubleshoot Your RoR Microservices with Distributed Tracing.” In part 1 of the series, I will introduce distributed tracing; why you need it, what it is, and how it helps in microservices architecture. In part 2, I will introduce OpenCensus and how you use it in Rails.
Microservices are great, but too hard?
Before I go into distributed tracing, I’ll first talk a little bit about microservices. If you are working on microservices architecture, or any sort of a distributed system, I presume you have suffered from debugging them or fixing them whenever a problem happens. You know everyone talks about microservices. And yes, it’s in fact a way of scaling a system and team. At the beginning it sounds great, but later when you start running microservices in production you realize it is too difficult to do things right.
Those who work on a monolithic Rails app, you are lucky, and you must be productive enough on your codebase unless it’s a huge gigantic 10-years old rails app. But if your product is successful, your system and your engineering team can grow very quickly, and whether you like it or not, your system tends to become a distributed system.
Let me show you something real from our system.
This is how our system used to look like when it launched back in 2012, when we were a small startup and only three software engineers worked in the company. Everyone must be familiar with this and know how easy it is to work on this. We only had a single monolithic rails app with one tiny database. Obviously the only programming language we used was Ruby.
After 7 years, we are now a public company and have offices in 4 countries. We provide many mobile apps on the platform, and we have about 100 services and 20 databases. Out entire system is now a polyglot architecture, where we use 5 different languages, Ruby (for human productivity), Go (for computer productivity), Python (for data scientists’ need), Node (for frontend lovers) and Rust (for a random reason). The system looks like the graph above. You can tell how complicated it is.
Hard to tell which components are involved in a single end-user request
One of the biggest challenges with a microservices architecture is understanding an entire system architecture.
When a problem happens, figuring out specifically which microservices are involved in an end-user interaction is very often what you need to do.
In this example, you see our app, Wantedly People on the left hand side. This app helps you manage your contacts by reading business cards. When you scan business cards, this app detects the cards and sends the images of the cards to the backend. Then the backend extracts texts on the images, categorizes texts into various fields like first name, last name, email, company name, address, phone number, etc., and gets information about the company or recent news article to send back to the user. So it does a lot of things under the hood.
The sequence chart on the right hand side illustrates what exactly microservices and databases are involved in, and in what sequence and from which microservice they are called.
This chart is made by hand, by carefully reading through source code of many different microservices. We’ve made this particular one just because it is so important for new engineers in the team to understand it. However you can’t do this for every API endpoint you have. While that is certainly not an effective way of spending your time, it is also true that you need this type of information when your users see your system is not functioning. I’ll explain why.
For instance, let’s assume you have 7 microservices, A, B, C, D, E, F and G, plus 2 databases, for 2 features, X and Y. For feature X an app makes an API call to service A, and service A makes a call to C and D, C runs a query on a database and calls F, and D calls G.
Let’s say in your monitoring tool, you observe service G is unusually raising a lot of errors. You looked at any recent changes that have been deployed to service G, but it turned out there was no code change. So you have to start investigating what caused the errors.
Suppose you don’t have a clear picture of dependencies between microservices, what you would have to do, is randomly speculating and guessing which microservices could affect G, and any changes made to each of those, in order to find out a root cause.
Even if you have better understanding of dependencies, it is still difficult to tell its implication to users. When something goes wrong, you want to know the exact user impact, you need to understand which features are affected. Maybe feature X is not that important, while feature Y is much more critical to users, so if there is an issue with Y, you want to fix or mitigate as soon as possible.
Similarly, when your user experience slowness in a certain feature and you want to investigate the root cause, it is not obvious to tell which microservice or database is causing the problem.
Distributed tracing helps you!
These are the problems that distributed tracing solves. Distributed tracing is a technique that captures causal links of operations involved in a single request.
This is a simplified view of an example trace for feature X. From this trace you can tell what are components involved, in what order they are called, how long each operation takes, and which operation is failing or taking more time.
Just to explain some terminology of distributed tracing:
- A trace is a set of operations performed within a single end-to-end request.
- A trace can contain multiple spans. A span represents an operation.
- Spans can be nested to make a tree of spans, which represents causality.
- A span can be created by function calls within a process, remote procedure call over HTTP or gRPC for example, or anything you would like to instrument.
- A span has a name, and contains start and end time with other annotations.
Here are screenshots of distributed tracing UI provided by Datadog and Stackdriver. These are just examples, but every other distributed tracing backend provides a similar view.
In Datadog APM Trace view, each color corresponds to a service, which means in the above example there are 6 microservices involved in a trace. It shows you where it spends most of time on the right hand side. If you scroll down, you’ll see many requests are made concurrently. When you click a span, you’ll see more information, like URL path, HTTP method, parameters, request size, response code, stack trace and so on.
Datadog also provides what it calls a “service map”, which is effectively a dependency graph of microservices and databases similar to what I showed earlier. A circle represents a component, either a mircorservice or database. An arrow represents a dependency between them, which can be an API call or a DB query. What’s good about this is that it automatically creates from live tracing data. So this tells you the truth about your production system. If you click any of those circle, you can see upstream and downstream dependencies of the component.
This is from Stackdriver of Google Cloud Platform. Pretty similar to Datadog. You can filter traces by name and latency, for example, “traces that took more than 1000ms.” It visualizes distribution of trace latency on a chart at the top left, and you can click a dot on the chart to see a trace detail.
There are more features and views provided in each distributed tracing providers, or in each open source solutions, which I recommend you to research individually to see which one meets your team’s need.
OpenCensus (will be OpenTelemetry)
There are many choices of distributed tracing backends including paid SaaS solutions and open source solutions. Generally they provide similar UI, but there certainly are differences in features. Also, supported programming language may vary depending on your choice.
Here is where OpenCensus fits in.
OpenCensus is a set of vendor-neutral libraries that collects and exports traces and metrics. It provides libraries for major programming languages like Java, Go, Node, Python, C++, C#, PHP, Erlang/Elixir, and of course Ruby.
In the next article, I will introduce OpenCensus and show how you can use it in Ruby.
As a sidenote, OpenCensus and OpenTracing (which is also a set of libraries for tracing) have announced their merger in March 2019, and a roadmap to merge in April. The new project will be called OpenTelemetry.
You may wonder you should hold off for a while — But, no. This merger should not be a stopper for you to start using OpenCensus for distributed tracing as it clearly states they will provide a backward compatibility for some period.
As of this writing, it seems like OpenTelemetry is in a very early stage of development, and there will be good opportunities to contribute to the ecosystem. You can look at OpenTelemetry’s GitHub organization to find out more.
Here is a slide deck of my talk in case you want to take a look at it!