Tracing HTTP request latency in Go with OpenTracing

In Go 1.7 we have a new package net/http/httptrace that provides a convenient mechanism for observing what happens during an HTTP request. In this article I will show how it can be used in the context of distributed tracing, by using OpenTracing API to instrument a client and a server and visualize the results in Zipkin UI.

First, a bit of an introduction.

What is Distributed Tracing and OpenTracing?

Distributed tracing is a technique for monitoring and profiling systems built on microservices architecture, popularized by such systems as X-Trace, Google’s Dapper and Twitter’s Zipkin. At its foundation is the idea of distributed context propagation, which involves associating certain metadata with every request as it enters the system, and propagating that metadata across thread and process boundaries as the request execution fans out to other microservices. If we assign a unique ID to each inbound request and carry it as part of the distributed context, we can then stitch together various profiling data from multiple threads and multiple processes into a single “trace” that represents the execution of the request by our system.

Distributed tracing requires instrumentation of the application code (or the frameworks it uses) with profiling hooks and a context propagation mechanism. When we, at Uber, started building our distributed tracing system, we soon realized that there was no good API we could offer to developers that was internally consistent across programming languages, and was not strongly tied to a particular tracing system. It turned out we were not alone in that thinking, and in October 2015 a new community was formed that gave birth to the OpenTracing API, an open, vendor-neutral, language-agnostic standard for distributed tracing. You can read more about it in Ben Sigelman’s article about the motivations and design principles behind OpenTracing.

Show me the code already!

Alright then. Sheeesh…

import (
"net/http"
"net/http/httptrace"

"github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go/log"
"golang.org/x/net/context"
)

// We will talk about this later
var tracer opentracing.Tracer

func AskGoogle(ctx context.Context) error {
// retrieve current Span from Context
var parentCtx opentracing.SpanContext
parentSpan := opentracing.SpanFromContext(ctx);
if parentSpan != nil {
parentCtx = parentSpan.Context()
}

// start a new Span to wrap HTTP request
span := tracer.StartSpan(
"ask google",
opentracing.ChildOf(parentCtx),
)

// make sure the Span is finished once we're done
defer span.Finish()

// make the Span current in the context
ctx = opentracing.ContextWithSpan(ctx, span)

// now prepare the request
req, err := http.NewRequest("GET", "http://google.com", nil)
if err != nil {
return err
}

// attach ClientTrace to the Context, and Context to request
trace := NewClientTrace(span)
ctx = httptrace.WithClientTrace(ctx, trace)
req = req.WithContext(ctx)

// execute the request
res, err := http.DefaultClient.Do(req)
if err != nil {
return err
}

// Google home page is not too exciting, so ignore the result
res.Body.Close()
return nil
}

There are a few things to note here:

  • For now I side-stepped the question of the tracer variable initialization. I will come back to it later.
  • The AskGoogle function accepts context.Context object. This is the recommended way in Go for developing distributed applications, as the Context object is what allows distributed context propagation.
  • We assume that the context already contains a parent tracing Span. A Span in the OpenTracing API is used to represent a unit of work performed by a microservice. An HTTP call is a good example of an operation that can be wrapped in a tracing Span. When we run a service that handles inbound requests, the service normally would create a tracing span for each request and store it in the Context, so that it is available when we make a downstream call to another service (in our example to google.com).
  • We start a new child Span to wrap the outbound HTTP call. It is ok if the parent span is missing, by the way.
  • Finally, before making the HTTP request we instantiate a ClientTrace and attach it to the request.

ClientTrace struct is the basic building block of httptrace. It allows us to register callbacks that will be executed by the HTTP client during the lifecycle of an HTTP request. For example, ClientTrace struct has methods like these:

type ClientTrace struct {
...
// DNSStart is called when a DNS lookup begins.
DNSStart func(DNSStartInfo)
        // DNSDone is called when a DNS lookup ends.
DNSDone func(DNSDoneInfo)
...
}

We create an instance of this struct in the NewClientTrace method:

func NewClientTrace(span opentracing.Span) *httptrace.ClientTrace {
trace := &clientTrace{span: span}
return &httptrace.ClientTrace {
DNSStart: trace.dnsStart,
DNSDone: trace.dnsDone,
}
}

// clientTrace holds a reference to the Span and
// provides methods used as ClientTrace callbacks
type clientTrace struct {
span opentracing.Span
}

func (h *clientTrace) dnsStart(info httptrace.DNSStartInfo) {
h.span.LogKV(
log.String("event", "DNS start"),
log.Object("host", info.Host),
)
}

func (h *clientTrace) dnsDone(httptrace.DNSDoneInfo) {
h.span.LogKV(log.String("event", "DNS done"))
}

We register two callbacks for DNSStart and DNSDone events implemented by a private struct clientTrace that holds a reference to the tracing Span. In the callback methods we use the Span’s key-value logging API to record information about the events, along with the timestamp that is captured implicitly by the Span itself.

Didn’t you say something about the UI?

Just in time. The way OpenTracing API works is once the Finish() method on the tracing Span is called, the data captured by the span is sent to the tracing system backend, usually asynchronously in the background. Then we can use the tracing system UI to find the trace and visualize it on the timeline. However, the example above was only meant to illustrate the principle of using OpenTracing with httptrace. For the real working example we will use an existing library https://github.com/opentracing-contrib/go-stdlib, from Dominik Honnef, that does most of the instrumentation for us. Using this library our client code does not need to worry about tracing the actual HTTP call. However, we still want to create a top-level tracing Span to represent the overall execution of the client application and record any errors to it.

package main

import (
"fmt"
"io/ioutil"
"log"
"net/http"

"github.com/opentracing-contrib/go-stdlib/nethttp"
"github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go/ext"
otlog "github.com/opentracing/opentracing-go/log"
"golang.org/x/net/context"
)

func runClient(tracer opentracing.Tracer) {
// nethttp.Transport from go-stdlib will do the tracing
c := &http.Client{Transport: &nethttp.Transport{}}

// create a top-level span to represent full work of the client
span := tracer.StartSpan(client)
span.SetTag(string(ext.Component), client)
defer span.Finish()
ctx := opentracing.ContextWithSpan(context.Background(), span)

req, err := http.NewRequest(
"GET",
fmt.Sprintf("http://localhost:%s/", *serverPort),
nil,
)
if err != nil {
onError(span, err)
return
}

req = req.WithContext(ctx)
// wrap the request in nethttp.TraceRequest
req, ht := nethttp.TraceRequest(tracer, req)
defer ht.Finish()

res, err := c.Do(req)
if err != nil {
onError(span, err)
return
}
defer res.Body.Close()
body, err := ioutil.ReadAll(res.Body)
if err != nil {
onError(span, err)
return
}
fmt.Printf("Received result: %s\n", string(body))
}

func onError(span opentracing.Span, err error) {
// handle errors by recording them in the span
span.SetTag(string(ext.Error), true)
span.LogKV(otlog.Error(err))
log.Print(err)
}

The client code above calls a local server. Let’s implement it, too.

package main

import (
"fmt"
"io"
"log"
"net/http"
"time"

"github.com/opentracing-contrib/go-stdlib/nethttp"
"github.com/opentracing/opentracing-go"
)

func getTime(w http.ResponseWriter, r *http.Request) {
log.Print("Received getTime request")
t := time.Now()
ts := t.Format("Mon Jan _2 15:04:05 2006")
io.WriteString(w, fmt.Sprintf("The time is %s", ts))
}

func redirect(w http.ResponseWriter, r *http.Request) {
http.Redirect(w, r,
fmt.Sprintf("http://localhost:%s/gettime", *serverPort), 301)
}

func runServer(tracer opentracing.Tracer) {
http.HandleFunc("/gettime", getTime)
http.HandleFunc("/", redirect)
log.Printf("Starting server on port %s", *serverPort)
http.ListenAndServe(
fmt.Sprintf(":%s", *serverPort),
// use nethttp.Middleware to enable OpenTracing for server
nethttp.Middleware(tracer, http.DefaultServeMux))
}

Note that the client makes a request to the root endpoint “/”, but the server redirects it to the “/gettime” endpoint. Doing so allows us to better illustrate how the trace will be captured in the tracing system.

Let’s run the example

All source code used in this article is available on Github. I assume you have a local installation of Go 1.7, as well as a running Docker that we will use to run the Zipkin server.

The demo project uses glide for dependency management, please install it first. On Mac OS, for example, you can do:

$ brew install glide

Then get the demo code, build and run it:

$ go get github.com/yurishkuro/opentracing-go-nethttp-demo
$ cd $GOPATH/src/github.com/yurishkuro/opentracing-go-nethttp-demo
$ glide install
# ... lots of output ...
$ go build .
$ ./opentracing-go-nethttp-demo
2016/10/06 00:10:54 Starting server on port 8000

Now in another terminal, let’s start the Zipkin server

$ docker run -d -p 9410-9411:9410-9411 openzipkin/zipkin:1.12.0
Unable to find image 'openzipkin/zipkin:1.12.0' locally
1.12.0: Pulling from openzipkin/zipkin
4d06f2521e4f: Already exists
93bf0c6c4f8d: Already exists
a3ed95caeb02: Pull complete
3db054dce565: Pull complete
9cc214bea7a6: Pull complete
Digest: sha256:bf60e4b0ba064b3fe08951d5476bf08f38553322d6f640d657b1f798b6b87c40
Status: Downloaded newer image for openzipkin/zipkin:1.12.0
da9353ac890e0c0b492ff4f52ff13a0dd12826a0b861a67cb044f5764195e005

If you do not have Docker, another way to run Zipkin server is directly from the jar:

$ wget -O zipkin.jar 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec'
$ java -jar zipkin.jar

Open the UI:

open http://localhost:9411/

You should see something like the screenshot below. Note that the Service Name dropdown is empty since we did not generate any traces yet.

UI home pag

Now open another terminal and run:

$ cd $GOPATH/src/github.com/yurishkuro/opentracing-go-nethttp-demo
$ ./opentracing-go-nethttp-demo -actor client
Received result: The time is Thu Oct 6 00:23:22 2016

If you reload the UI page, you should see “client” appear in the first dropdown.

UI home page after a trace was generated.

Click the Find Traces button, and you should see a single trace.

Search results showing one trace.

Click on the trace.

Trace timeline view.

Here we see the following spans:

  1. The top-level (root) span called “client” generated by the service also called “client”, which spans the whole timeline for 3.957ms.
  2. The next-level (child) span called “http client”, also generated by the service “client”. This span was automatically created by the go-stdlib library and spans the overall HTTP session.
  3. The two spans called “http get” generated by the service called “server”. This is a bit misleading as each of those spans actually internally consists of two parts, the data submitted by the client, and the data submitted by the server. Zipkin UI always picks the name of the receiving service to display on the left side. These two spans represent the first request to “/” endpoint, and after receiving the redirect response the second request to “/gettime” endpoint.

Notice also that the last two spans display white dots on the timeline. If we mouse-over one of those dots we will see that they are actually the events captured by the ClientTrace, such as DNSStart:

Mouse-over on the white dots displays events reported by httptrace

You can also click on each span to find more details, including the timestamped logs and key-value tags. For example, clicking on the first “http get” span shows the following popup:

3rd span details, part 1

Here we see two types of events. The overall start/end events from the point of view of client and server: Client Send (request), Server Receive (request), Server Send (response), Client Receive (response). And between them we see other events logged to the span by the go-stdlib instrumentation as they were reported by httptrace, such as the DNS lookup that started at 0.16ms and finished at 2.222ms, establishing the connection, and sending/receiving the request/response data.

3rd span details, part 2

This is the continuation of the same popup showing the span’s key/value tags. The tags are not associated with any timestamp and simply provide metadata about the span. Here we can see at which URL the request was made, the 301 response code (redirect) that was received, the hostname where the client was run (masked), and some information about the tracer implementation, like the client library version “Go-1.6”.

The details of the 4th span are similar. A couple of things to note is that the 4th span was much shorter because there was no DNS lookup delay, and it was against the /gettime endpoint with status code 200

4th span details

Did you forget about the tracer variable?

I almost did. The tracer is the actual implementation of the OpenTracing API. In my example I used https://github.com/uber/jaeger-client-go, which is an OpenTracing-compatible client library from Jaeger, Uber’s distributed tracing system. Jaeger’s client libraries are open sourced, and we plan to open source the backend and the UI by the end of 2016. The main program in this example looks like this:

package main

import (
"flag"
"log"

"github.com/uber/jaeger-client-go"
"github.com/uber/jaeger-client-go/transport/zipkin"
)

var (
zipkinURL = flag.String("url",
"http://localhost:9411/api/v1/spans", "Zipkin server URL")
serverPort = flag.String("port", "8000", "server port")
actorKind = flag.String("actor", "server", "server or client")
)

const (
server = "server"
client
= "client"
)

func main() {
flag.Parse()

if *actorKind != server && *actorKind != client {
log.Fatal("Please specify '-actor server' or '-actor client'")
}

// Jaeger tracer can be initialized with a transport that will
// report tracing Spans to a Zipkin backend
transport, err := zipkin.NewHTTPTransport(
*zipkinURL,
zipkin.HTTPBatchSize(1),
zipkin.HTTPLogger(jaeger.StdLogger),
)
if err != nil {
log.Fatalf("Cannot initialize HTTP transport: %v", err)
}
// create Jaeger tracer
tracer, closer := jaeger.NewTracer(
*actorKind,
jaeger.NewConstSampler(true), // sample all traces
jaeger.NewRemoteReporter(transport, nil),
)

if *actorKind == server {
runServer(tracer)
return
}

runClient(tracer)

// Close the tracer to guarantee that all spans that could
// be still buffered in memory are sent to the tracing backend
closer.Close()
}

The Jaeger tracing system itself is different from Zipkin, but we are currently using a Zipkin-compatible Thrift data model, which allows me to use the Zipkin server to show the trace.

Conclusion

We used a combination of several components to illustrate a simple tracing of an HTTP request, both across the process boundaries and with a deep dive into the actual HTTP request execution:

  1. Standard net/http/httptrace package,
  2. https://github.com/opentracing-contrib/go-stdlib instrumentation library,
  3. Jaeger Go client https://github.com/uber/jaeger-client-go that implements OpenTracing API, and
  4. Zipkin server and UI from https://github.com/openzipkin/zipkin.

The source code of the demo is available here: https://github.com/yurishkuro/opentracing-go-nethttp-demo.