Distributed Tracing and Monitoring using OpenCensus

Monitoring and Troubleshooting is one of the hardest aspects of operationalizing micro-service based applications, especially those that may have used OTS components and third-party services. Proper instrumentation can make a big difference in your operation.

Have you ever thought about instrumenting your application to troubleshoot a problem or monitor the health of your micro-services that make up your application? If you haven’t, you are not alone. I often see an inertia when it comes to instrumenting an app for better observability. When not familiar with the library, one wonders which library do I use? which backend / APM do I use? The examples are not in the language I am interested in, etc, etc.

For those of you who attended Google Next 18, you may have seen the “Hipster Shop” demo used in many different talks. I have retrofitted the demo, which uses 10 micro-services and is written in multiple languages to show how easily you can instrument your app with OpenCensus (OC). I’m hoping this blog will help some of you with getting started and will provide as a reference example.

Before I dive into the demo app let me provide a brief introduction to OpenCensus.

OpenCensus

OpenCensus (OC) is a single distribution of open source libraries to provide observability with distributed tracing and monitoring of your micro-services and monoliths alike. OC is an open-source project backed by Google and Microsoft and has growing community behind it. The community is very active to make sure that your favorite frameworks, libraries and modules are automatically instrumented without you having to do anything. For example, OC library is supported in the following languages

Java , Go, Python, C++, Nodejs, Erlang, Ruby, PHP, C#

OpenCensus allows you to collect metrics and traces once and then export them to a variety of backends such as Prometheus, Stackdriver Tracing and Monitoring, DataDog, Graphite, Zipkin, Jaeger, Microsoft Azure Application Insights etc.

OpenCensus Tracing

In a distributed system it is important to know how a request flows from one service to another and how long it takes to perform a task in each service. OpenCensus Tracing provides a mechanism to collect data across multiple spans[] and later join these spans to form a single trace for a request.

These traces can be annotated for additional information that can be useful for troubleshooting.

OC tracing is integrated with gRPC (Java and Go) and HTTP (Go, Python) which makes it very easy to enable tracing in your distributed application. This is demonstrated in this blog using a distributed retail application call Hipster Shop1.

OpenCensus Stats

Stats/Metrics are very useful in determining the health of the overall application and the health of individual micro-services from various perspectives. For example, mean and 99th percentile latency of the request, number of cache hit vs cache miss, etc. OC Stats provides a mechanism to collect these stats and aggregate them. It also allows tagging which can be used to group and filter them.

Again OC integration with gRPC (Java and Go) and HTTP (Go) makes it very easy to enable certain metric. You can certainly add your custom metric. The retail demo application Hipstershop will demonstrate how easy it is to start collecting stats/metric.

Demo Application: Hipster Shop

Hipstershop is a demo application based on 10 micro-services written in multiple languages (Go, Java, Python, Nodejs and C#). In this blog we will show how Hipstershop application uses OpenCensus to enable Tracing and Stats.

Click here for instructions on how to download and run Hipstershop application

Enabling Tracing

HTTP Go

It only requires two steps to start collecting traces with HTTP Go integration. Please see the code excerpt below from the Frontend micro-service.

Step 1: Initialize the net/http handler

Initialize the handler with the Opencensus Go HTTP plugin (ochttp).

import (
...
"go.opencensus.io/plugin/ochttp"
"go.opencensus.io/plugin/ochttp/propagation/b3"
...
)
func main() {
...
var handler http.Handler = r
handler = &logHandler{log: log, next: handler}
handler = ensureSessionID(handler)
        // add opencensus instrumentation
handler = &ochttp.Handler{
Handler: handler,
Propagation: &b3.HTTPFormat{}}
        log.Infof("starting server on " + addr + ":" + srvPort)
log.Fatal(http.ListenAndServe(addr+":"+srvPort, handler))
}

Step 2: Register your exporter

Register an exporter to export traces to available backend. For this demo, I have used Jaeger Tracing.

func initJaegerTracing(log logrus.FieldLogger) {
// Register the Jaeger exporter to be able to retrieve
// the collected spans.
exporter, err := jaeger.NewExporter(jaeger.Options{
Endpoint: "
http://jaeger:14268",
Process: jaeger.Process{
ServiceName: "frontend",
},
})

if err != nil {
log.Fatal(err)
}
trace.RegisterExporter(exporter)
}


func initTracing(log logrus.FieldLogger) {
// This is a demo app with low QPS. trace.AlwaysSample() is used here
// to make sure traces are available for observation and analysis.
// In a production environment or high QPS setup please use
// trace.ProbabilitySampler set at the desired probability.
trace.ApplyConfig(trace.Config{
DefaultSampler: trace.AlwaysSample(),
})
initJaegerTracing(log)

...
}

NOTE: trace.AlwaysSample() is used to sample all traces for demo purposes only. In production setup use default sampler or appropriate sample as per your need.

gRPC Go

By default tracing is enabled in gRPC Go. So it only requires registering Exporter to start collecting traces with gRPC Go integration. Product Catalog micro-service is used to show the code snippet.

Step 1: Register exporter

Registering an exporter is same as you would for HTTP Go.

func initJaegerTracing() {

// Register the Jaeger exporter to be able to retrieve
// the collected spans.
exporter, err := jaeger.NewExporter(jaeger.Options{
Endpoint: "
http://jaeger:14268",
Process: jaeger.Process{
ServiceName: "productcatalogservice",
},
})
if err != nil {
log.Fatal(err)
}
trace.RegisterExporter(exporter)
}


func initTracing() {
// This is a demo app with low QPS. trace.AlwaysSample() is used here
// to make sure traces are available for observation and analysis.
// In a production environment or high QPS setup please use
// trace.ProbabilitySampler set at the desired probability.
trace.ApplyConfig(trace.Config{DefaultSampler: trace.AlwaysSample()})
        initJaegerTracing()

}

NOTE: trace.AlwaysSample() is used to sample all traces for demo purposes only. In production setup use default sampler or appropriate sample as per your need.

gRPC Java

Similar to gRPC Go, gRPC Java also has tracing enabled by default. Simply register the exporter. The AdService micro-service is the code excerpt to demo using OpenCensus for Java.

Step 1: Register exporter

Register exporter to start exporting traces.

import io.opencensus.exporter.trace.jaeger.JaegerTraceExporter;

/** Main launches the server from the command line. */
public static void main(String[] args) throws IOException, InterruptedException {
    ...
    // Register Jaeger Tracing.
JaegerTraceExporter
.createAndRegister("
http://jaeger:14268/api/traces",
"adservice");
    ...
    final AdService service = AdService.getInstance();
service.start();
service.blockUntilShutdown();
}

NOTE: trace.AlwaysSample() is not used here because Frontend service has already sampled the request and sampling option is propagated.

With the above simple steps you can get distributed traces of your request. Here is a sample trace from Hipstershop GET /product request.

Above trace for ‘PlaceOrder’ shows latency contribution from each span to overall latency of the request. ProductCatalogService.GetProduct span is the most expensive span in terms of latency.

Enabling Stats

HTTP and gRPC Go

Enabling stats is very similar to enabling tracing. The Frontend micro-service will provide us with an excerpt. The Frontend receives HTTP requests and invokes other micro-services over gRPC.

Step 1: Initialize http handler (Incoming Requests)

Initialize the handler with Opencensus plugin (ochttp).

import (
"go.opencensus.io/plugin/ochttp"
)
func main(){
var handler http.Handler = r
handler = &logHandler{log: log, next: handler}
handler = ensureSessionID(handler)
handler = &ochttp.Handler{
Handler: handler,
Propagation: &b3.HTTPFormat{}}
        log.Infof("starting server on " + addr + ":" + srvPort)
log.Fatal(http.ListenAndServe(addr+":"+srvPort, handler))
}

Step 2: Initialize grpc handler (Outgoing Requests to backend services)

Frontend initiates request over gRPC (gRPC Client). Initialize the handler with Opencensus plugin (ocgrpc).

import (
"go.opencensus.io/plugin/ocgrpc"
)

func mustConnGRPC(ctx context.Context, addr string) *grpc.ClientConn {
conn, err := grpc.DialContext(ctx, addr,
grpc.WithInsecure(),
grpc.WithStatsHandler(&ocgrpc.ClientHandler{}))
if err != nil {
panic(errors.Wrapf(err, "grpc: failed to connect %s", addr))
}
return conn
}

Step 3: Register exporter and enable HTTP and gRPC plugin views

We’ll again register our stats exporter to export metrics to our backend of choice. For demo purposes, I am using Prometheus. Enable the default ochttp and ocgrpc Views.

func initPrometheusStatsExporter(log logrus.FieldLogger) *prometheus.Exporter {
exporter, err := prometheus.NewExporter(prometheus.Options{})
       if err != nil {
log.Fatal("error registering prometheus exporter")
return nil
}
       view.RegisterExporter(exporter)
return exporter
}
func startPrometheusExporter(log logrus.FieldLogger, exporter *prometheus.Exporter) {
addr := ":9090"
log.Infof("starting prometheus server at %s", addr)
http.Handle("/metrics", exporter)
log.Fatal(http.ListenAndServe(addr, nil))
}

func initStats(log logrus.FieldLogger) {

// Start prometheus exporter
exporter := initPrometheusStatsExporter(log)
go startPrometheusExporter(log, exporter)
        if err := view.Register(ochttp.DefaultServerViews...); err != nil {
log.Fatal("error registering default http server views")
}
if err := view.Register(ocgrpc.DefaultClientViews...); err != nil {
log.Fatal("error registering default grpc client views")
}
}

gRPC Java

The AdService micro-service is excerpted to show OpenCensus in Java.

Step 1: Enable the gRPC views

/** Main launches the server from the command line. */
public static void main(String[] args) throws IOException, InterruptedException {
...
// Registers all RPC views.
RpcViews.registerAllViews();
...
}

Step 2: Register a stats exporter

Register the exporter to start exporting stats.

import io.opencensus.exporter.stats.prometheus.PrometheusStatsCollector;

public static void main(String[] args) throws IOException, InterruptedException {
...
    // Register Prometheus exporters and export metrics to a Prometheus HTTPServer.
PrometheusStatsCollector.createAndRegister();
HTTPServer prometheusServer = new HTTPServer(9090, true);
...
    final AdService service = AdService.getInstance();
service.start();
service.blockUntilShutdown();
}

Now you can get stats for services. Here are few sample charts created out of data exported by OpenCensus.

Overall Latency Chart from HTTP Server Views

99 percentile Overall Latency is ~2–4 seconds. Breaking down this latency by micro-service can help which service is causing the large latency.

Micro-services Latency Chart from gRPC Client Views

PlaceOrder request is the largest contributor of the overall Latency.

Tracing chart from Jaeger dashboard depicts what contributes to high latency for PlaceOrder request. It is in fact, multiple calls to ProductCatalogService.GetProduct that takes about 90–100 ms each.

Following change in the GetProduct method reduces the latency.

func (p *productCatalog) GetProduct(ctx context.Context, req    *pb.GetProductRequest) (*pb.Product, error) {
var found *pb.Product
- for i := 0; i < len(parseCatalog()); i++ {
- if req.Id == parseCatalog()[i].Id {
- found = parseCatalog()[i]
+ products := parseCatalog()
+ for i := 0; i < len(products); i++ {
+ if req.Id == products[i].Id {
+ found = products[i]
}

After the above change both latencies, overall latency and PlaceOrder request latency have reduced (see the charts below)

You can also monitor bandwidth consumed by your service and request/response rate by http_method/http_status respectively.

Bandwidth Chart from HTTP Server Views

HTTP Response Rate by Status

Conclusion

With OpenCensus and few lines of instrumentation you can improve your observability significantly. Why wait? Download the demo and get started today.