Everything about the ‘CCTV of the monitoring world’ aka distributed traces.

Published in

Intelligent Observer

8 min readDec 14, 2018

This article is an advanced read on all things distributed tracing: the must-dos and the must-nots. The idea behind this article is not to push the New Relic envelope, it is to educate the masses (techies) about all things distributed tracing.

Photo: Application agents are exactly like a CCTV (Pixabay, CC0)

Who is this for?

For any developer, DevOps practitioner, CTO, app owner, SRE resource or a business leader who wishes to provide 5-Star user experience to his/her customers and wishes to learn about the secret sauce that can deliver it.

Who am I?

I am a digital Intelligence evangelist for a major monitoring vendor and an open-source systems enthusiast. I belong to the church of monitoring data and graphs. Essentially, I earned my stripes and rights to speak on this subject by enabling 100s of companies over the last 10 years by democratising monitoring to a larger group of people and training scores of support, developers and DevOps resources. I strongly believe that a best-in-class monitoring platform should narrate a story rather than simply display monitoring outputs.

At a recent developers’ conference in Chennai and Bangalore where I was a panelist and a speaker, my counterparts at RedHat (IBM) who develop and enhance the OpenShift platform shared with me that the no. 1 pitfall for developers is tools’ choice paralysis which requires abstraction and black boxing of the current DevOps methodology tools (configuring – Terraform, cloud formation; versioning – Github or Gitlab; packaging – docker, service orchestration – ECS, K8s, Lambdas; service mesh – Istio; deploy systems – Jenkins or TeamCity and monitoring systems – New Relic, SumoLogic, Zipkin, Jaeger etc.). Imagine being a developer in today’s time, somebody must have suggested that you pick this profession because all you need to know is ‘write code’. Not anymore! This need of the hour gave birth to the ‘Develop for the developers’ momentum in world’s biggest software factory – India’s Silicon Valley and elsewhere in the world where software factories proliferate – Canada, USA, Germany, Japan, China, Australia etc. Following the same principles, I decided to write an article to extend my support to SREs, DevOps teams and developers with their distributed tracing journeys. IMHO, the biggest pitfall of the uptake of cloud, micro-service and DevOps trends is performance/service reliability issues. Just think of your monoliths as a big ball of problems which micro-services now distribute into small chunks of problems all over the place, in turn making it harder than before to find and remediate issues.

Outside of my tech Evangelist role, I enjoy watching tv-series based on criminal science, thriller and psycho-thriller genres. I simply think of a distributed trace as a critical auto-forensics evidence that allows solving analogically-speaking the murder-mystery (user experience accidents). In a typical crime scene, you have various clues around you such as broken glasses, blood stains on the wall and footprints on the floor. If someone inadvertently removes these evidence, you would need a miracle and time on your side to find the culprit(s). Sounds familiar?? App problems in the microservices world are very similar to a crime scene, and on-calls whereby we use disjointed tools (several silo tools) work like manual forensic analysis to find out the culprit. In such a circumstance, your distributed traces work like a hidden CCTV within your apps that creates transactions recording which enables you with auto forensics & evidence. Plus, when you add anomaly detection to this club, you speed up the fault finding (MTTI – mean time to identification and MTTR – mean time to repair) that typically takes hours into just a few mins. The anomaly detection powered by applied intelligence works like ‘an SRE in a button’ for several customers of mine across APAC.

Enough analogies, let’s dive straight in.

As per my own understanding, distributed trace in simple words is an ‘effective and easy mechanism to collect transaction recording or stack traces that display cross-microservice spans or hops in a time-based waterfall breakdown (Gantt chart or stack trace)’. A key attribute of distributed traces is the time component or latency for each span or hop, and the span level details. It is arguably the hottest monitoring output which is trending right now.

There are several ways of collecting distributed traces – 1) log-based (out-of-band) 2) agent-based (out-of-band) 3) agent-based (in-band) and 4) API-based (out-of-band and in-band).

IMHO, there are no right or wrong techniques but if you follow the DevOps principles, you want to use the simplest, quickest and the most automated technique.

If you think-alike or concur, then go with agent-based (out-of-band), simply because out-of-band agents efficiently cater for async sub-transaction milestones, whilst auto-instrumenting your code whereas in-band collection is almost the same as out-of-band agents but don’t necessarily track span creation of call-back loop milestones within a customer transaction journey.

Netflix and Google use millions of container in your end-user service based app ecosystem. Even if this no. is in 100s for you, the permutations and combinations for a request flow could result into a deep-stack trace or Gantt chart for which you need traces that will provide a CCTV service for user transaction accidents (murder mystery solving). Transaction accidents mean request/response failures and transactions that are slow or sluggish in nature.

Step 1

Before attempting enablement of distributed tracing monitoring output, take a step back and make an effort to understand the beast aka your app ecosystem.

Step 2

Specifically, list down any legacy components and adjudge if a particular legacy system is in the critical path. For any legacy services that is a black hole and on the critical path, use your vendor-provided distributed tracing APIs that wrapper around the request calls that go through some legacy systems. By default, always use commonly supported run-time binaries and libraries.

Step 3

Choose auto-instrumentation over code drop-ins needed by open source tracing libraries, code drop-ins cause instrumentation vendor lock-ins. Most users (SRE, DevOps and Developers) assume that you need to start by instrumenting your code, NOT TRUE.

Step 4

You should start by deploying your language agents that help trace RPC or HTTP calls traversing through various microservices or hybrid services. You should only use tracing API provided by your vendor when you need custom spans which are not automatically traced and also when you wish to annotate payload led key-values like username, app_owner, business_error count, queue size, or a total user count etc that will enhance your fault-finding should you need to leverage the transaction recordings (aka distributed traces) for quick RCA.

A cool annotation-scheme example

Let’s suppose, you’re an Enterprise with both logs and agent monitoring tools at your perusal. If you want to perform a correlation between distributed trace patterns inspection UI and your logs context for the same application, you can easily pass in your logs UI’s URI as one of the keys which have a dynamic value for each application for which trace collection is enabled. This way when your distributed trace UI (hopefully, you will use New Relic) will show you your Splunk / Sumo / ELK contextual URI segments within the span attributes (as a subset of the payload attributes).

Viola, very cool indeed!

Example of a custom span collection:

An OTS or custom Java/PHP code which uses an unconventional framework which is not supported / compatible by your agent by default. In this scenario, your distributed traces will break by default but fret not, we can easily fix this. Call your distributed traces to collect a custom span and join the breakage by simply calling the distributed tracing APIs around the request processing calls/classes and methods within your custom libraries.

No nonsense value proposition of distributed traces asides transaction time optimization by several orders of magnitude:

Identify unnecessary serial requests happening within a subsystem that someone else wrote.
Database access and processing correctness.
If you’re a new member of a team and tasked with full stack development or support, you can understand how the system talks to its dependencies.
Understand rare transaction flow processing which has low throughout from the end users.
Use it in dev and test phase to identify issues early so that they never make it to prod and impact end users. Recently, I saw a post by a Harvey Norman prospective client who complains that the company didn’t want the money. No one served him in store, so he went online where the furniture-specific add to cart link was broken. It was hilarious but alarming to see the ‘revenue leakage’ caused by lack of testing/code problems which could have been fixed by using distributed tracing in a test phase.
Identify 3rd party integrations or dependencies that fail to meet its SLA so that you can pick the phone and register a support or complaint.

Its super easy!! Try it!

Finally, let’s bust some common myths surrounding distributed tracing:

You need all traces! WRONG… traces have an overhead cost, full traces collection can add 2 digit % latency on both response time and infra resource consumption. Adaptive sampling traces collection technique is more effective and impart a really small (negligible) overhead on app resources.
Tail-based sampling is superior over adaptive head-based sampling… WRONG…. Both of these approaches have their own pros and cons. Happy to write a separate article if there is a community demand.
You only need distributed traces… BOGUS….. Traces are just one element of a monitoring output. You still need request flow maps (service maps), app-Infra correlation (for 1st level isolation)
Distributed traces assist in app-Infra correlation. PARTIALLY…. you need a full view into Infra performance KPIs to rule out infrastructure exhaustion as the culprit. As a monitoring Evangelist, I recommend using app-Infra correlation maps to rule out Infra saturation and only if app code is the suspect, use distributed trace.
Distributed traces with flow maps is sufficient! NOT REALLY…… if you think of time as an asset, you need anomaly detection (powered by machine intelligence algorithms) to automatically pinpoints the anomalous spans which when clicked showcase full span level details like average throughput, response time and histogram (this host vs all hosts comparison) to rule out load-balancing or rouge-host issue.
It’s a must to have logs monitoring capability for distributed tracing! NOT AT ALL …… distributed tracing is a monitoring output, logs, and agent-based monitoring is a collection technique. Essentially, app agents are better geared for app code-level performance output collection. Distributed traces are just one such form of an output or monitoring asset.

Happy distributed tracing!

Feel free to reach out to me if you wish to clarify your doubts or learn more about modern app stack monitoring.

I’m a modern app stack monitoring Evangelist at New Relic with deep experience with OSS and proprietary monitoring solutions under the belt. Happy to be of service to you. Simply reach out to me on LinkedIn or Twitter.

Twitter handle: @monitorjain

GitHub: https://github.com/njain1985

LinkedIn: nikmjain