OpenTelemetry: beyond getting started

Published in

OpenTelemetry

12 min readJan 15, 2020

This blog post explores real world scenarios for telemetry collection and enrichment using OpenTelemetry. It is based on the KubeCon NA talk Beyond Getting Started: Using OpenTelemetry to Its Full Potential by Morgan McLean and myself. You can watch the recording on YouTube or keep reading.

Capturing telemetry from applications is a big challenge for developers who want better observability into their applications. For this to work, you need integrations with every language, web framework, RPC system, storage client, etc.

OpenTelemetry is a complete solution that solves the problem of telemetry collection. Simply add Jaeger, Prometheus, Zipkin, or any other open source backend and you get insights into your app via distributed traces and metrics. Use Azure Monitor, Google StackDriver or other telemetry vendor and you get a managed turn key telemetry solution.

Getting started

The goal of OpenTelemetry is to make robust, portable telemetry a built-in feature of cloud-native software. In other words, we want every platform and library to be pre-instrumented with OpenTelemetry and we’re committed to making this as easy as possible. Even though we are still working on pre-instrumenting software, for many platforms and libraries enabling monitoring is as easy as adding a couple lines of code.

For example, in an ASP.NET Core application, you can add these few lines of code to capture metrics and traces — call AddOpenTelemetry and pass in arguments of what to collect and where to send telemetry.

services.AddOpenTelemetry(b => 
  b.UseZipkin(o => o.Endpoint=new Uri(“http://zipkin/api/v2/spans"))
    .AddRequestCollector()
    .AddDependencyCollector());

And you can see results in Zipkin. In most of the examples in this post, as you can note from the distributed trace below, ClientApp calls into FrontEndApp, which in turn queries BackEndApp for data. Note that information about all incoming requests and outgoing calls is automatically collected without any explicit configuration.

OpenTelemetry architecture and extensibility

When a library or integration calls into OpenTelemetry, this is what the telemetry delivery pipeline looks like:

API is used to instrument a library
SDK to send telemetry into Processing and Exporting pipeline
Out of the process Collector is used for data filtering, aggregation and batching
Exporter sends telemetry to the backend of your choice.

While OpenTelemetry is a complete solution, it still allows lots of room for innovation. Our main goal is to make telemetry built-in into software, not to lock you into a specific solution.

OpenTelemetry has a lot of extensibility points.

When we think of extensibility, we try to strike a balance between being flexible and useful out of the box. We believe that most telemetry needs can be provided via small adjustments, not through a major rewrite of the entire system.

OpenTelemetry layers and extensibility points

Let’s look at extensibility points from right to left

Collector can communicate with various backends via exporters. This is a great extensibility point as it is out of the process and doesn’t require app redeployment. You can configure backend after the deployment or even swap in runtime.
Configuration controls aggregation, batching, and processing in a collector.
In-process exporters are easily replaceable to work with different backends. While the default exporter will send telemetry to the Collector, in some deployments you cannot host a separate collector process So you need to send telemetry directly to the backend from inside the process.
SDK allows various extensions: sampling, filtering, enrichments. It is often harder to configure this logic in-proc than in collector. But sometimes it’s needed for performance reasons or due to data availability.
Finally, the most aggressive extensibility is the ability to completely replace OpenTelemetry SDK with alternative implementation.

Now let’s talk about some real life scenarios you can encounter while using OpenTelemetry and how these scenarios are addressed using OpenTelemetry extensibility points. Scenarios are split into two big parts — scenarios for app owners, and techniques which may be used by client library developers. These scenarios show some important components and core concepts of OpenTelemetry. Note, that APIs may change as OpenTelemetry is still in alpha. But the foundational concepts will stay.

Application Monitoring scenarios

Long-running tasks

With automatic data collection you’d typically get all RPC libraries instrumented. This is why you get a full chain of ClientApp calling FrontEndApp and BackEndApp visualized after a very simple onboarding. However, OpenTelemetry will not capture long-running tasks automatically. So if the BackEndApp initiates a long running task — synchronous or asynchronous — to return data, you need to call it out explicitly.

It is very easy to start a long running task — just start a new Span using Tracer APIs.

_tracer = tracerFactory.GetTracer(“custom”);
using (_tracer.StartActiveSpan(“LongRunningOperation”, out var s))
{
  Thread.Sleep(TimeSpan.FromMilliseconds(400));
  span.Status = Status.Ok.WithDescription(“COMPLETE”);
}

In many talks and articles, manually starting a span is demonstrated as a getting started experience. It is our aspiration that even this simple code will be recognized as a beyond getting started experience.

Basic sampling

Another common use case is synthetic traffic filtering. Telemetry from synthetic traffic may hide the real user problems. Use a custom Sampler to filter out synthetic traffic, such as calls to a “/health” endpoint.

Every call to the “/health” endpoint produces the distributed trace with one span that doesn’t contain any meaningful information beyond the fact that it was called. And if this endpoint is used to check on an app availability, it may be called very often.

Sampler implementation returns sampling decision from ShouldSample method. In this example, based on span name, it either returns the decision to not sample or calls into another sampler that can make a decision on real — non synthetic — traffic.

public class HealthRequestsSampler : ISampler
{
  private ISampler _sampler;
  public HealthRequestsSampler(ISampler chainedSampler)
  {
    _sampler = chainedSampler;
  }  public string Description { get; } = “HealthRequestSampler”;  public Decision ShouldSample(
                        SpanContext parentContext,
                        ActivityTraceId traceId,
                        ActivitySpanId spanId,
                        string name,
                        IEnumerable<Link> links)
  {
    if (name == “/health”)
    {
      return new Decision(false);
    }
    return _sampler.ShouldSample(
                       parentContext, traceId, spanId, name, links);
  }
}

Here is how to configure sampling in your telemetry pipeline:

services.AddOpenTelemetry(b => {
  . . .
  b.SetSampler(new HealthRequestsSampler(Samplers.AlwaysSample));
}

After enabling the sampler, spans for the “/health” endpoint will no longer be collected.

Custom attributes

Adding custom attributes to make it easier to differentiate or query telemetry data is another common scenario. Custom attributes, structured as key:value pairs, support the addition of relevant dimensions, for example:

Business details, such as a productID, price, or logical operation name
User session attributes, such as login status, free tier/paid customer, anonymous user id
Infrastructure details: raw values of an http header

While adding an attribute to the manually tracked telemetry is fairly straightforward, sometimes you need to add attributes to the automatically tracked Spans. In most languages it’s easy to do from inside the scope of a Span execution. In C# you simply need to get the “Current” Span.

In the controller class WeatherForecastController.cs:

[HttpGet]
public IEnumerable<WeatherForecast> Get()
{
  // this is how you can set a custom attribute
  _tracer.CurrentSpan.SetAttribute(“forecastSource”, “random”);
  . . .
}

You can see the new attribute appearing in the UI and you can use it for things like querying or troubleshooting.

Resource API

Some attributes have special semantics and use cases. For example, if your app is deployed in multiple environments, the environment name becomes a very important custom attribute that must be applied globally, as it will be used to segment all the telemetry.

The resource API is used to define global attributes like this. Examples of attributes may be:

Deployment name and location
App name and version
Hosting environment

Since resource API attributes are global, they are typically configured on app initialization. In Startup.cs:

services.AddOpenTelemetry(b => {
  . . .
  // sets resource
  b.SetResource(new Resource(new Dictionary<string, string>() {
    { “service.name”, “MyBackEndApp” },
    { “deploymentTenantId”, “kubecon-demo-surface” } 
 }));
}

You can notice that standard attributes like “service.name” will be used to define the name of the application. How other resource attributes are used and exposed will depend on the backend system being used — for instance, in Zipkin most would simply be listed among regular span attributes

Attributes on Span set using Resource API. — Attributes on Span set using Resource API

Custom attributes in code scopes

When an app is using A/B testing and feature “flights”, telemetry reported from the certain scope — either execution of flight A or flight B — should be attributed with this FlightID.

It is not a simple telemetry attribution. It is typical that the attributes like this will also be used to configure a metric dimension and potentially used to adjust the sampling logic as you may want all flights to be represented in telemetry even if the flight is only configured for a small percent of traffic.

OpenTelemetry has a concept of Context. You can wrap any logic in a context like this:

// this will set a context for all telemetry in using block
using (DistributedContext.SetCurrent(
         new DistributedContext(new DistributedContextEntry[] {
           new DistributedContextEntry(“FlightID”, “A”) })))
{
  // some code and calls to external services are here
}

Now you need to configure every Span produced in this scope to be automatically attributed with this FlightID. SpanProcessor, configured on the telemetry pipeline will be called on the start of every Span. You can configure one to get all context properties and use them as attributes of a Span.

public class FlightIDProperties : SpanProcessor
{
  public override void OnStart(Span span)
  {
    foreach( var entry in DistributedContext.Current.Entries)
    {
      span.SetAttribute(entry.Key, entry.Value);
    }
  }  public override void OnEnd(Span span)
  {
  }

  public override Task ShutdownAsync(CancellationToken token)
  {
    return Task.CompletedTask;
  }
}

Now simply add this SpanProcessor to the telemetry pipeline like this:

services.AddOpenTelemetry(b => {
  . . .

  // set the FlightID from the distributed context
  b.AddProcessorPipeline(pipelineBuilder =>
       pipelineBuilder.AddProcessor(_ => new FlightIDProperties()));
});

And every Span started in a scope of this context will be attributed with the “FlightID” with the value “A”.

Propagation of context attributes

Furthermore, in the case of some scope attributes like FlightID, you want to add the same value of FlightID to the telemetry from all downstream services.

Propagation of attributes between components

With OpenTelemetry you can change the example above a little bit by specifying that the context entry must be propagated downstream:

using (DistributedContext.SetCurrent(new DistributedContext(
        new DistributedContextEntry[] { new DistributedContextEntry(
         “FlightID”,“A”,EntryMetadata.UnlimitedPropagationEntry)})))
{
  // some code and calls to external services are here
}

The aspiration of OpenTelemetry is to make telemetry a built-in feature of all software. In .NET the feature of context propagation is built into the framework. You can set attributes that will be automatically associated with the scope of a current request and will also be propagated downstream via special headers.

Let’s follow the example from this blog post. ClientApp sets the FlightID property using Activity API. Note, this is a different API. As a glimpse into a future — we are working on merging these APIs.

var activity = new Activity(“Call”).AddBaggage(“FlightID”, “red”);
activity.Start();
try
{
  _ = GetWeatherForecast();
}
finally
{
  activity.Stop();
}

And then it is getting associated with every Span in a distributed trace:

public class FlightIDProperties : SpanProcessor
{
  public override void OnStart(Span span)
  {
    foreach (var b in Activity.Current.Baggage)
    {
      span.SetAttribute(b.Key, b.Value);
    }
  }  public override void OnEnd(Span span)
  {
  }

  public override Task ShutdownAsync(CancellationToken ct)
  {
    return Task.CompletedTask;
  }
}

Note, that in this example ClientApp calls into FrontEndApp, which in turn calls BackEndApp. And the only application that enabled OpenTelemetry is a BackEndApp. However all the context variables are still passed from the ClientApp all the way to the BackEndApp.

As we mentioned, this API and OpenTelemetry APIs are on the path for unification. So going forward, context propagation will be built-in into all RPCs and OpenTelemetry will simply use this context.

Context propagation

Another important use case is an ability to propagate context when custom RPC is used.

OpenTelemetry helps with the context propagation on standard RPC, but for custom protocols and especially for messaging, it must be implemented using propagation API. Let’s say communication between caller and callee happens using the Message class. And Message class supports collection of string pairs as a metadata. Than it is very easy to serialize and deserialize the context to and from the Message class:

// injecting context into the custom message
var message = new Message();_tracer.TextFormat.Inject<Message>(_tracer.CurrentSpan.Context, message, (m, k, v) => m.Metadata.Add(k, v));

However our hope and aspiration is that you will never need to use this logic. And all RPC and messaging libraries will handle context propagation and telemetry reporting as a built-in feature. Next section describe how one will approach instrumenting the client library.

Instrumenting libraries

Whenever you develop a client library to access your back-end service or messaging protocol, it is a good idea to allow users to get visibility into the library behavior.

You have two choices:

Build an OpenTelemetry integration that hooks into callbacks or performance APIs provided by the client.
Instrument the shared code with OpenTelemetry APIs.

Whenever there is a choice like this, option two is preferred. It’s more performant and doesn’t break when clients are updated. OpenTelemetry exposes callbacks and APIs to make libraries instrumentation easy and straightforward without compromising on performance.

IsRecording?

If SDK was NOT enabled, nothing needs to be captured. OpenTelemetry allows to check the state by exposing the flag IsRecording. It is a good practice, however, for RPC libraries to always propagate the context as recording may be enabled on upstream and downstream components.

In this example, attribute “state” will only be added to the span when recording of the span is enabled. However span context will be propagated in any case.

using (_tracer.StartActiveSpan(
                         “Execute”, SpanKind.Client, out var span))
{
  if (span.IsRecording)
  {
    span.AddAttribute(“state”, this.CalculateState());
  }  _tracer.TextFormat.Inject(span.Context, restObj,
                        (restObj, k, v) => restObj.Metadata[k] = v);
  restObj.Execute();
}

Named tracers

OpenTelemetry uses named tracers for many reasons:

to improves data visualization and analysis
save costs by disabling specific tracers
simplify troubleshooting of missing or questionable telemetry reporting

Always use descriptive and unique tracer names:

private readonly ITracer _tracer;
public MyClientLibrary()
{
  _tracer = TracerFactoryBase.Default
                             .GetTracer(“MyClientLibrary”, version);
}

Metrics

Metrics and distributed traces are coming together. Use of metrics has many benefits:

metrics are not affected by sampling
lightweight as semantics are easier
aggregation dimensions can be decided on later

var meter = MeterFactoryBase.Default.GetMeter(
                                        “MyClientLibrary”, version);
var reqCount = meter.CreateLongCounter(“requests count”);reqCount.Add(DistributedContext.Current, 1, meter.GetLabelSet(new Dictionary<string, string>() { {“success”, “true” } }));

Performance best practices

The art of instrumenting for telemetry: just enough telemetry for the price of an overhead in CPU and latency. There are so many details you may want to expose, while performance overhead of telemetry needs to be kept very low.

The art of instrumentation: just enough telemetry for the price of an overhead

There are few rules that can help you decide how to instrument your library:

Only create spans for longer-running tasks that are worth tracking,
Don’t create spans for every function call!
Use time event to indicate event occurrence vs. child span
Use smart defaults and allow to configure additional details collection

Libraries instrumentation is a rich topic with many gotchas and interesting details. We talked about a few practices and APIs, but this topic deserves a separate blog post.

Tell us about your scenarios

We want to know more about our users! OpenTelemetry doesn’t report analytics back to us, so we only know about your experience if you tell us.

Developing OpenTelemetry we can only guess how you use it:

What environments you run your application in?
Which features and extensibility points do you like the most?
What’s missing?

We value feedback and comments, please reach out to us and tell about your scenarios. There are many ways to reach out:

Gitter: https://gitter.im/open-telemetry/community
GitHub: https://github.com/open-telemetry/community
E-mails: cncf-opentelemetry-community@lists.cncf.io
SIG and community meetings: calendar

Thank you Amelia Mango, Austin Parker, Morgan McLean, Yuri Shkuro for review and edits of this article!