Sanket Gawande
Aug 2 · 9 min read

In the world of Distributed architecture and services, debug-ability and observability becomes very critical. Without these fundamental features, services break down at scale and then finding the exact problem is like searching for a needle in a haystack. At ShareChat, we touch peak scales daily and need to make sure that we always have, both, high-level and machine-level data at our fingertips. There’s a lot of work we have done over these topics in the recent past and we will present these over multiple blog posts.

In this blog, we will focus on Distributed Tracing in particular, our build v/s buy analysis, our choices, the rationale for our choices and finally a preview of our implemented solution in action.

Overview

It is no secret that online experiences are built out of responses from a multitude of services. Some people call them micro-services to make them sound cool. Each service’s response can potentially dictate the end user’s experience depending on whether it is fail-open or not. The deeper the tech stack, the more difficult it is to trace the error to the _exact_ service that is causing the error. This, it is necessary to track the lifetime of a request to understand system bottlenecks, potential problems and to narrow down to the right service while debugging.

The “3:00am Problem”

The story goes that life before Distributed Tracking was implemented, the Engineering team was facing an interesting conundrum. When production issues used to occur in the wee hours of the morning, the first team to pick up the call would the internal Platform team since their alerts would be the first to go off. After the initial debugging, the Platform team used to wake the next layer of service(s) owners where the errors were showing up from. This team, after their debugging, used to wake up the next team. This continued until most of the Company was awake — all sitting for multiple hours until the root cause was identified.

At ShareChat, we call this the classic “3:00am problem” (3AMP).

If you have troubleshot Distributed Systems, you would realise that the maximum time is spent in trying to find out where the problem emanates from. Once the system-at-fault has been identified, the fix itself becomes very straightforward. Distributed Tracing helps narrow down the problem to the “where” (i.e. the source of the problem) so that the “what” (i.e. the cause of the problem) can then be taken up quicker.

Production issues will always continue to happen. Although the ideal solution to the 3AMP is to not have it happen at all, things will always go wrong. At ShareChat, our practical solution to the 3AMP is that even if we are woken up at 3:00am, we will not need to wake the entirety of the company up and directly reach the team responsible for the service causing the error. This minimizes churn, improves employee morale and ends up benefiting our end users who we really care about.

  • We are going to understand distributed tracing and its benefits
  • Stackdriver vs Jaeger
  • How to enable Stackdriver tracing in any Node.js projects
  • Tagging a request with custom labels
  • Distributed tracing at ShareChat

Distributed Tracing in a nutshell

Everyone understands what a standalone application stack trace is. Distributed Tracing extends the concept over the network and into the next application layer. This is not to be confused with network tracing using traceroute or ping. It is better understood as being able to seamless continuing to debug a distribution application like it were a single monolith application! Here, we can track the entire lifecycle of the request along with other metadata like latency between each hop, error conditions, etc. and later this information can be reassembled to get a complete picture of the entire distributed application behaviour at runtime.

OpenTracing and OpenCensus

Long before there were systems, there were Standards! OpenTracing and OpenCensus were the two primary standards that were created for distributed tracing. Each got its boost from the different supports of these standards across different platforms and scale.

OpenTracing is focused on establishing an open API and specs and not on open implementations for each language and tracing system.OpenCensus, on the other hand, takes a more holistic or all-inclusive approach. OpenCensus provides not only the specification but also the language implementations and wire protocol. It also goes beyond tracing by including additional metrics that are normally outside the scope of distributed tracing systems. As such, there are talks of them coming together to unify and hence help boost a singular platform instead of dividing the overall focus.

At ShareChat, we don’t reinvent the wheel unless we find that the wheel needs too much greasing. With that ideas, we looked at options to build versus buy. Two tools bubbled up to the top: Stackdriver by Google and Jaeger by Uber.

Jaeger versus Stackdriver

Jaeger is distributed tracing tool by Uber that CNCF has adopted as incubating project. It supports ElasticSearch and cassandra for backend storage. Jaeger is based on opentracing specifications. Using Jaeger involves an overhead in maintenance and scaling of backend infrastructure since it expects the storage systems to be set up outside of actual application changes (Elastic and Cassandra).

Google Stackdriver (“GS”), on the other hand, is entirely managed distributed tracing solution provided by Google and is based on opencensus. All of the Stackdriver setup and scaling is entirely managed by Google managed services which required almost negligible setup and teardown from the application side.

As such, each system stood up in most aspects we were interested in: ease of integration, scale, community/company support, UI and API support but the differentiating factor was the set up costs and maintenance. We believe that offloading the costs of setup would help us focus on the things that matter more for us at this time and we decided to go ahead with Stackdriver.

Stackdriver

A Stackdriver Trace is a distributed tracing system that collects latency data from your applications and displays it in the Google Cloud Platform Console and provides near real-time performance insights. It also analyzes all of your application’s traces to generate in-depth latency reports to surface performance degradation, and can capture traces from all of your VMs, containers, or Google App Engine projects

Basic Terminology and Configuration

Trace — The entire lifecycle of one particular request is called a trace

Span — This is the primary building block of a distributed trace, representing an individual unit of work done in a distributed system.

Label — labels are key-value pairs associated with traces

In order to avoid instrumenting each and every API for tracing we have used the node.js trace-agent provided by Google. This agent uses async-hooks to monkey patch instrumentation into our code at runtime. Enabling a Stackdriver Trace with the trace-agent instrumentation library is as simple as importing module in your file.

  1. Install the npm module
npm install — save @google-cloud/trace-agent
  1. Require the module and set up the config.
const tracer = require(‘@google-cloud/trace-agent’).start(config);
config = {
projectId: 'GoogleProjectID',
keyFilename": 'credentials.json',
logLevel: 1,
ignoreUrls: [ '^\/$' ] ,
ignoreMethods : ['OPTIONS'],
enhancedDatabaseReporting: true,
samplingRate : 1,
serviceContext: {
service: 'ServiceName',
version : '1.0'
}
}

Here are how each of the parameters defined:

  • projectId: ID of the Google cloud platform project with which the traces should be associated
  • keyFilename: A path to credentials key file relative to the current working directory
  • logLevel: 0=disabled, 1=error, 2=warn, 3=info, 4=debug
  • ignoreUrls: URLs that match this regular expression will not be traced. Health checker probe URLs (/_ah/health) are ignored by default.
  • ignoreMethods: Request methods that match this method name will not be traced
  • samplingRate: 0 indicates that sampling will be disabled and all incoming requests will be traced. samplingRate<0 indicates none of the requests will be traced
  • enhancedDatabaseReporting: If true additional information about query parameters and results will be attached as labels in the request.

The path to the keyFileName can be specified either within the config object or mentioned in an environment variable named GOOGLE_APPLICATION_CREDENTIALS. Likewise, the projectID can also be specified in the env variable called GCLOUD_PROJECT.

That’s it! After adding the above line of code with appropriate configuration the traces will start recording in Stackdriver backend and will be available on GCP console.

In Practice

Now that we have a basic understanding of distributed tracing and Stackdriver we will take a look at how we set it up for ShareChat .

Library

Our goal was to make it seamless and transparent to Developers. To achieve that, we have abstracted the configuration and generic modifications to be applied to the request as a module and it simply needs to be imported in the client code. We then wrote a proxy service written in Go that communicates using Stackdriver backend APIs to get Trace details. This also gives us the ability to add any generic modifications to the request (like adding labels)

The module includes following operations:

  • Importing Google’s Trace Agent
  • All the configuration related to tracing
  • Any custom labels or modifications to be applied to a trace

Custom Fields and Force Tracing

The standard tracing gives all the intended results but it’s still not ready for practical use. The debug trace usually gets looked up with various domain specific keys, e.g. userID. To add the support for these domain specific keys, we intercepted the request in the middleware and added them using the support for custom labels provided by Stackdriver.

const tracer = require(‘@google-cloud/trace-agent’).start();
tracer.getCurrentRootSpan().addLabel(‘userId’, userId);

There are some other critical scenarios that would require us to trace specific request, e.g. a specific user facing certain issues that is not reproducible elsewhere. In other words, we would want to force requests to be traced, irrespective of the sampling rate. For such use cases, we made use of a header known as x-cloud-trace-context. This contains combination of traceID, spanID and flag which tells stackdriver if this particular request should be traced or not.

x-cloud-trace-context : “traceID/spanID;o=1”

Trace agents checks for this header before sending the request to the stackdriver backend. If this header exists, it checks for the option flag (o=1 or o=0), it sends the request for tracing if it is set and discards otherwise. On the other hand if this header does not exist it generates traceId and spanId and sets the flag according to the sampling rate.

The easiest way to do that in at the fringe, typically an API gateway that allows all calls to be funnel-ed through. The API layer has a list of keys to check and force the trace if needed…simple enough!

A Wrapper…to wrap it all!

While the GCP Console for Traces has a reasonable UI, we needed it to be accessed by everyone and anyone in the company. A tool so ubiquitous should be made accessible to everyone and with that intent, we have written a proxy service that will help retrieve the same data via an API call to Stackdriver. This will also help build custom flows and actions such as filtering on various labels like http status code, http method, custom labels etc. Ultimately, this custom UI will be part of our master list of dashboard we use within the company.

Sharechat

India's fastest growing social network

Sanket Gawande

Written by

Sharechat

Sharechat

India's fastest growing social network

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade