Take OpenTracing for a HotROD ride

Yuri Shkuro
May 6, 2017 · 20 min read
Photo by Trekphiler

Update (21 March 2019): my book Mastering Distributed Tracing has a newer version of this tutorial.

Update (July 2017): watch this demo live at OpenShift Commons Briefing.

OpenTracing is a new, open standard for instrumenting applications and OSS packages for distributed tracing and monitoring. In this post we explore the features and capabilities of OpenTracing using a demo application HotROD from Jaeger, a distributed tracing system recently donated by Uber Technologies to Cloud Native Computing Foundation. The presentation is structured as a walkthrough / tutorial showcasing how OpenTracing concepts can be applied in practice to monitor a microservices-based architecture and perform root cause analysis of performance issues. It also emphasizes the key feature of OpenTracing: being a vendor-neutral API, it allows other open source frameworks, such as those used for RPC, implement distributed tracing without locking themselves to a single vendor. The HotROD demo app’s source code contains no explicit use of the OpenTracing API around the RPC calls; instead all instrumentations comes for free from open source.

Prerequisites

  1. The recommended way to run the Jaeger backend is with a Docker image, thus a Docker installation is required.
  2. The complete source code of the demo is available on GitHub:
    https://github.com/uber/jaeger/tree/master/examples/hotrod
    Please see instructions in the README on how to download and run both the Jaeger backend and the HotROD demo app.

Agenda

  1. Use the tracing UI to understand the app’s architecture and data flow.
  2. Compare standard logs vs. OpenTracing logs.
  3. Identify sources of latency and errors in the call graph.
  4. Attribute time spent in the leaves of the call tree to information only available at the root of the tree via “baggage” propagation.
  5. Obtain RPC endpoint metrics without any additional instrumentation.
  6. Look at code examples of how each of the above features was achieved. Most of them come for free, i.e. without tedious manual instrumentation, because we use OSS frameworks already integrated with OpenTracing

Start Jaeger

docker run -d -p6831:6831/udp -p16686:16686 jaegertracing/all-in-one:latest

Once the container starts, open http://127.0.0.1:16686/ in the browser to access the Jaeger UI.

Jaeger UI front page, the Search screen.

The container runs the Jaeger backend with an in-memory store, which is initially empty, so there is not much we can do with the UI right now since the store has no traces. Let’s generate some traces using the HotROD app.

Meet the HotROD

git clone git@github.com:jaegertracing/jaeger.git jaeger
cd jaeger
go run ./examples/hotrod/main.go all

EDIT (2018–03–31): starting from version 1.3 the Jaeger project includes a Docker image for the HotROD example (see README), which is good for a quick test run of tracing the app, but in this post we will also be making small changes to the source code, so running from source is recommended.

The all command tells it to run all microservices from a single binary. In the logs written to standard out we can see these microservices starting several servers on different ports:

2017–05–03T23:53:19.693–0400 INFO cmd/all.go:31 Starting all services
2017–05–03T23:53:19.696–0400 INFO log/logger.go:43 Starting {“service”: “route”, “address”: “http://127.0.0.1:8083"}
2017–05–03T23:53:19.696–0400 INFO log/logger.go:43 Starting {“service”: “frontend”, “address”: “http://127.0.0.1:8080"}
2017–05–03T23:53:19.697–0400 INFO log/logger.go:43 Starting {“service”: “customer”, “address”: “http://127.0.0.1:8081"}
2017–05–03T23:53:19.697–0400 INFO log/logger.go:43 TChannel listening {“service”: “driver”, “hostPort”: “127.0.0.1:8082”}

Let’s navigate to the main service at http://127.0.0.1:8080:

HotROD single-page web app

We have four customers, and by clicking one of the four buttons we summon a car to arrive to the customer’s location, perhaps to pick up a product and deliver it elsewhere. Once a request for a car is sent to the backend, it responds with the car’s license plate number and the expected time of arrival:

We ordered a car that will arrive in 2 minutes.

There are a few bits of debugging information we see on the screen.

  1. In the top left corner there is a web client id: 9323. It is a random session ID assigned by Javascript UI; if we reload the page we get a different session ID.
  2. In the line about the car we see a request ID req: 9323–1. It is a unique ID assigned by Javascript UI to each request it makes to the backend, composed of the session ID and a sequence number.
  3. The last bit of debugging data, latency: 782ms, is measured by the Javascript UI and shows how long the backend took to respond.

This additional information has no impact on the behavior of the application, but will be useful when we look under the hood.

The Architecture

Empirically constructed service dependency diagram

It turns out, the single HotROD binary is actually running four microservices and, apparently, two storage backends. The storage nodes are not actually real, they are simulated by the app as internal components, but the top four microservices are indeed real. We saw each of them logging a network address of the servers they run. The frontend microservice serves the Javascript UI and makes RPC calls to the other three microservices. The graph also shows the number of calls that were made to handle the single request for a car. E.g. the route service was called 10 times, and there were 14 calls to Redis.

The Data Flow

The first trace

The system found one trace and displayed some meta-data about it, such as the names of different services that participated in the trace, and the number of spans each service emitted to Jaeger. The top level endpoint name is displayed in the title bar: HTTP GET /dispatch. On the right side we see that the total duration of the trace was 774.85ms. This is shorter than 782ms we saw in the HotROD UI, which is not surprising, because the latter was measured from the HTTP client side by Javascript, while the former was reported by the Go backend. The 7.15ms difference between these numbers can be attributed to the network latency. Let’s click on the trace title bar.

Trace timeline view

The timeline view shows a typical view of a trace as a time sequence of nested spans, where a span represents a unit of work within a single service. The top level span, also called the root span, represents the main HTTP request from Javascript UI to the frontend service, which in turn called the customer service, which in turn called a MySQL database. The width of the spans is proportional to the time a given operation takes. It may represent a service doing some work or waiting on a downstream call.

From this view we can see how the application handles a request:

  1. The frontend services receives the external HTTP GET request at its /dispatch endpoint.
  2. The frontend service makes HTTP GET request to the /customer endpoint of the customer service .
  3. The customer service executes an SQL SELECT statement in MySQL. The results are returned back to the frontend service.
  4. Then the frontend service makes an RPC request Driver::findNearest to the driver service. Without drilling more into the trace details we cannot tell which RPC frameworks is used to make this request, but we can guess it is not HTTP (it is actually made over TChannel).
  5. The driver service makes a series of calls to Redis. Some of those calls are highlighted with pink background, indicating failures.
  6. After that the frontend service executes a series of HTTP GET requests to the /route endpoint of the route service.
  7. Finally, the frontend service returns the result to the external caller.

If we click on any span in the timeline, it expands to show more details, including span tags, process tags, and logs. Let’s click on one of the failed calls to Redis:

Expanded span details

We see that the span has a tag error=true, which is why the UI highlighted it as failed. We can also see that it has a log statement that explains the nature of the error as “redis timeout”. We can also see the driver_id that the driver service was attempting to retrieve from Redis.

Contextualized logging

Typical logging output

It is quite difficult to follow the application logic from these logs. And we are only looking at the logs when a single request was executed by the application. Imagine many concurrent requests going through the system. The logs become nearly useless in that case. So let’s take a different approach. Let’s view the logs collected by the tracing system. For example, click on the root span to expand it and its logs.

Logs recorded by the tracing system in the root span.

This view shows only the logs that happened during the execution of the root span. We call them contextualized, because they are captured in the context of a specific request and even specific span within the trace. We have seen earlier an error log from the timed-out Redis call. In the stdout it would’ve been mixed with a lot of other statements, but in the tracing system it is neatly isolated to the service and span where it was relevant. Contextualized logs allow us to drill into the behavior of the application without worrying about logs from other parts of the program or from other concurrent requests.

Span Tags vs. Logs

Tags and Logs

In the customer span we can see a tag http.url that shows that the request at /customer endpoint had a parameter customer=123. In the mysql span we see an sql.query tag showing the exact SQL query that was executed.

How do we decide if this data should go to span tag or span logs? The OpenTracing API does not dictate how we do it; the general principle is that information that applies to the span as a whole should be recorded as a tag, while events that have timestamps should be recorded as logs.

The OpenTracing specification repository defines semantic data conventions that prescribe certain well known tag names and log fields for common scenarios. Instrumentation is encouraged to use those names to ensure that the data reported to the tracing system is well defined and portable across different tracing backends.

Identifying sources of latency

  1. The call to the customer service is on the critical path because no other work can be done until we get back the customer data, which includes the location to which we need to dispatch the car.
  2. The driver service retrieves the 10 nearest drivers given the customer location and then queries Redis for each driver’s data in a sequence, which can be seen in the staircase pattern of redis GetDriver spans.
  3. The calls to the route service are not sequential, but not fully parallel either; we can see that at most three requests can be in progress, and as soon as one of them ends another request starts. This behavior is typical when we use a fixed size executor pool.

What happens if we make many requests to the backend simultaneously? Let’s go to the HotROD UI and click on one of the buttons repeatedly (and quickly).

Executing many requests simultaneously shows increasing latency.

As we can see, the more requests are being processed concurrently, the longer it takes for the backend to respond. Let’s take a look at the trace of the longest request. We could do it in two ways. We can simply search for all traces and pick the one with highest latency, represented by the longest cyan-colored title bar.

Multiple traces returned in the search results.

Another way is to search by tags or logs on the span. The root span emits a final log where it records the winner driver ID as a field:

Driver ID recorded in the log field.

The Jaeger backend indexes all spans by tags and log fields, and we can find that trace by specifying driver:T716217C (no spaces) in the Tags search box:

Searching for trace by tags

Let’s open this long, 1.8sec trace and compare it with the 700ms trace we saw earlier.

High latency trace.

The most apparent difference is that the mysql span takes a lot longer than before, 1.37s instead of 305ms. Let’s expand that span and try to find out why.

Inspecting very long mysql span.

In the logs we see that execution was blocked waiting for a lock for over a second. This is clearly a bottleneck in the application, but before we dive into that let’s look at the previous log record, evidently emitted before getting blocked on the lock. It tells us how many other requests were already queued for this lock, and even gives us the identity of those requests. It is not too hard to imagine a Lock implementation that keeps track of how many goroutines are blocked, but where would it get the identity of the requests? If we expand the previous span for the customer service, we can see that the only data passed to it via HTTP request was the customer ID 392. In fact, if we inspect every span in the trace we will not find any remote call where the request ID like 5038–3 was passed as a parameter.

Customer 392 is the only info passed by the caller.

This “magic” appearance of concurrent request IDs in the logs is due to a feature of the OpenTracing standard called baggage. Distributed tracing works because certain meta-data is propagated across thread and process boundaries, throughout the whole call graph, by the tracing instrumentation using the OpenTracing API. One example of such meta-data is trace and span IDs. Another example is baggage, which is a general key-value store embedded in every inter-process request. HotROD’s Javascript UI stores session ID and request ID in the baggage before making the request to the backend, and that baggage is made available to every service involved in handling the request, without the need to pass that information explicitly as request parameters. It is an extremely powerful technique that can be used to propagate various useful information (e.g. security tokens) in the context of a single request throughout the architecture, without needing to change every service to understand what they are propagating. The Pivot Tracing project shows fascinating examples of using baggage for dynamic monitoring.

In our example, knowing the identities of the requests stuck in the queue ahead of our long tail request allows us to find traces for those requests and analyze them as well. In real production systems this could lead to the unexpected discovery of a long running request spoiling a lot of other requests that are normally very fast. Later we will see another example of using baggage.

Now that we know that mysql call gets stuck on a lock, we can easily fix it. As we mentioned earlier, the application does not actually use MySQL database, just a simulation of it, and the lock is meant to represent a single database connection shared between multiple goroutines. We can find the code in the file examples/hotrod/services/customer/database.go:

// simulate misconfigured connection pool that only gives
// one connection at a time
d.lock.Lock(ctx)
defer d.lock.Unlock()
// simulate db query delay
delay.Sleep(config.MySQLGetDelay, config.MySQLGetDelayStdDev)

Notice how we pass the ctx parameter to the lock object. context.Context is a standard way in Go to pass request-scoped data throughout the application. The OpenTracing span is included in the context, which allows the lock to inspect it and retrieve the Javascript’s request ID from the baggage.

Let’s comment out the lock statements, to pretend that we fixed the code to use a connection pool with enough capacity that our concurrent requests do not have to compete for connections. And let’s also decrease the delay config.MySQLGetDelay (in services/config/config.gofile) from 300ms to 100ms, then restart the HotROD app and repeat the experiment.

Improved latency after fixing mysql bottleneck.

The latency still increases as we add more requests to the system, but it no longer grows as dramatically as with the single mysql bottleneck before. Let’s look at one of the longer traces again.

Trace after removing mysql bottleneck

As expected, the mysql span stays at around 100ms regardless of the load. The driver span is not expanded, but it takes the same time as before. The interesting change is in the route calls, which take more than 50% of the total request time. Previously we saw these requests executing in parallel three at a time, but now we see periods without even a single request running. Clearly, we have contention with other goroutines on some limited resource. And we can also see that the gaps happen between the spans of the frontend service, which means the bottleneck is not in the route service, but in how thefrontend service calls it. Let’s look at the code in file services/frontend/best_eta.go, function getRoutes():

// getRoutes calls Route service for each (customer, driver) pair
func (eta *bestETA) getRoutes(
ctx context.Context,
customer *customer.Customer,
drivers []driver.Driver,
) []routeResult {
results := make([]routeResult, 0, len(drivers))
wg := sync.WaitGroup{}
routesLock := sync.Mutex{}
for _, dd := range drivers {
wg.Add(1)
driver := dd // capture loop var
// Use worker pool to (potentially) execute
// requests in parallel
eta.pool.Execute(func() {
route, err := eta.route.FindRoute(ctx, driver.Location, customer.Location)
routesLock.Lock()
results = append(results, routeResult{
driver: driver.DriverID,
route: route,
err: err,
})
routesLock.Unlock()
wg.Done()
})
}
wg.Wait()
return results
}

This function receives a Customer record (with address) and a list of Drivers (with their current locations), and calculates the ETA for each driver. It calls the route service for each driver using an executor pool. So as long as we have enough executors in the pool we should be able to run all calculations in parallel. The size of the executor pool is defined in services/config/config.go:

RouteWorkerPoolSize = 3

The default value of 3 explains why we saw at most three parallel requests in the very first trace we inspected. Let’s change it to a 100 (remember that goroutines are cheap), restart HotROD, and rerun the test:

Latency results after fixing worker pool bottleneck

We have to click on the buttons really fast now, because the requests return back in less than half a second.

Trace after fixing worker pool bottleneck

And as expected, the calls from frontend to route service are all done in parallel now, thus minimizing the overall request latency.

We leave the final optimization of the driver service as an exercise for the reader.

Resource Usage Attribution via Baggage

To demonstrate this approach, the route service contains the code to attribute the CPU time of calculations to the customer and session ID which it reads from the baggage. In the services/route/server.go file we can see this code:

func computeRoute(
ctx context.Context,
pickup, dropoff string,
) *Route {
start := time.Now()
defer func() {
updateCalcStats(ctx, time.Since(start))
}()
// actual calculation
}

As before, we don’t pass any customer / session ID because they can be retrieved from baggage via the context. The updateCalcStats function uses the expvar package from Go’s standard library to accumulate and present the results.

var routeCalcByCustomer = expvar.NewMap(
"route.calc.by.customer.sec",
)
var routeCalcBySession = expvar.NewMap(
"route.calc.by.session.sec",
)
var stats = []struct {
expvar *expvar.Map
baggage string
}{
{routeCalcByCustomer, "customer"},
{routeCalcBySession, "session"},
}
func updateCalcStats(ctx context.Context, delay time.Duration) {
span := opentracing.SpanFromContext(ctx)
if span == nil {
return
}
delaySec := float64(delay/time.Millisecond) / 1000.0
for _, s := range stats {
key := span.BaggageItem(s.baggage)
if key != "" {
s.expvar.AddFloat(key, delaySec)
}
}
}

If we navigate to http://127.0.0.1:8083/debug/vars, we can see route.calc.by.* entries that give us the breakdown of time (in seconds) spent in the calculation on behalf of each customer and some UI sessions.

route.calc.by.customer.sec: {
Amazing Coffee Roasters: 1.479,
Japanese Deserts: 2.019,
Rachel's Floral Designs: 5.938,
Trom Chocolatier: 2.542
},
route.calc.by.session.sec: {
0861: 9.448,
6108: 2.530
},

How Do We Instrument with OpenTracing

// NewServer creates a new driver.Server
func NewServer(
hostPort string,
tracer opentracing.Tracer,
metricsFactory metrics.Factory,
logger log.Factory,
) *Server {
channelOpts := &tchannel.ChannelOptions{
Tracer: tracer,
}
ch, err := tchannel.NewChannel("driver", channelOpts)
if err != nil {
logger.Bg().Fatal("Cannot create TChannel", zap.Error(err))
}
server := thrift.NewServer(ch)

The only line related to OpenTracing is passing the tracer in the options. The function implementing the endpoint does not have any reference to OpenTracing either (services/driver/server.go):

// FindNearest implements Thrift interface TChanDriver
func (s *Server) FindNearest(
ctx thrift.Context,
location string,
) ([]*driver.DriverLocation, error) {
s.logger.For(ctx).Info("Searching for nearby drivers",
zap.String("location", location))
driverIDs := s.redis.FindDriverIDs(ctx, location)

This is because TChannel, an open source framework, directly integrates with OpenTracing, another open source framework. When we create an application, we just need to instantiate a concrete Tracer and pass it in.

We can see a similar lack of instrumentation in the HTTP endpoints. In fact, the only place the OpenTracing instrumentation is used explicitly is in the functions that simulate Redis and MySQL, because there we are not actually making any RPC requests over any framework.

What About Vendor Neutrality

// Init creates a new instance of Jaeger tracer.
func Init(
serviceName string,
metricsFactory metrics.Factory,
logger log.Factory,
) opentracing.Tracer {
cfg := config.Configuration{
Sampler: &config.SamplerConfig{
Type: "const",
Param: 1,
},
Reporter: &config.ReporterConfig{
LogSpans: false,
BufferFlushInterval: 1 * time.Second,
},
}
tracer, _, err := cfg.New(
serviceName,
)
if err != nil {
logger.Bg().Fatal(
"cannot initialize Jaeger Tracer",
zap.Error(err),
)
}
return tracer
}

What About Logging

In the FindNearest function above we see an example of the log statement:

s.logger.For(ctx).Info(
"Searching for nearby drivers",
zap.String("location", location),
)

The expression zap.String(k, v) is the structured logging API supported by a fast logging library for Go called Zap (https://github.com/uber-go/zap), so there’s nothing specific to OpenTracing here. The trick we are using in HotROD is a wrapper around Zap logger that does not expose normal logging methods. Instead it provides two methods:

  • For(context.Context) returns a contextualized logger for request-scoped logs;
  • Bg() returns a background logger for logging events that are not scoped to a request, such as application start-up sequence.

Accessing a contextualized logger always requires a Context object. Once we have the context object, all normal log methods can be delegated not only to the regular logger, but also translated into the calls to the logging methods on the Span. The source code can be found in pkg/log/spanlogger.go, e.g.:

func (sl spanLogger) Info(msg string, fields ...zapcore.Field) {
sl.logToSpan("info", msg, fields...)
sl.logger.Info(msg, fields...)
}

OpenTracing Is Not Just for Tracing

The Jaeger tracer for Go has an option to emit metrics for RPC, which is enabled in the HotROD demo. If we go back to the expvar endpoint http://127.0.0.1:8083/debug/vars, we can see various metrics collected for all endpoints in the application.

RPC metrics for /customer endpoint

For example, we can see that there were 24 successful requests to the /customer endpoint and zero errors. The expvar’s /debug/vars endpoint is a somewhat primitive way to display RPC metrics, so here is another example from Uber’s internal dashboard. All these metrics are emitted by OpenTracing instrumentation and the Jaeger tracer.

Example of charting metrics emitted by Jaeger tracer

Conclusion

  • Off the shelf instrumentation & vendor neutrality
  • Monitoring transactions spanning many microservices
  • Measuring latency of operations
  • Root cause analysis with span tags and logs
  • Contextualized span logs
  • Baggage propagation
  • Automatic generation of RPC metrics

The source code for the demo, as well as the Jaeger backend, is on GitHub: https://github.com/uber/jaeger. If something in this tutorial does not work, please open an issue.

OpenTracing

A blog discussing distributed tracing and context…