Serverless OpenTelemetry at scale: generating traces

Published in

PostNL Engineering

16 min readNov 16, 2023

Welcome to the second post in the Serverless OpenTelemetry at scale series. In the first part we covered the need for observability and the arguments for choosing OpenTelemetry. In this article we will look at the way we generate and propagate traces for optimal visibility into our systems.

The table below will link to the other parts when they are published.

Part 1: The PostNL context
Part 2: Generating traces
Part 3: Processing and sampling
Part 4: Insights galore!

Serverless event brokerage

In part 1 we introduced the Event Broker e-Commerce (EBE), PostNL’s serverless integration platform. This system is responsible for receiving events from many applications, validating these events against their schemas, and forwarding them to subscribed consumers. At the highest level, the architecture looks like this.

The architecture delivers a push-based flow of events: producers send events to the ingress service, which forwards them the to the transport service, which delivers them to the egress service. The ingress service is primarily responsible for validating events and rejecting them if they don’t match their schemas. The egress service is primarily responsible for delivering events to subscribed consumers.

The diagram above only shows a blue and a green producer, with three consumers subscribed to their events. In reality, we have dozens of producers and hundreds of subscribers. Our observability goal is to gain deep insight and understanding into these flows.

If we zoom in a little we see the serverless internals of the Event Broker e-Commerce. The ingress service is powered by Lambda, the transport service is based on EventBridge, and the egress service is centered around SQS and Lambda.

In the diagram above we also see how every single event processed by the EBE can be visualized as a trace. A distributed trace is one of the core observability signals. It combines multiple , each of which describes an isolated unit of work. Traces allow us to investigate and understand how an event is moving through a distributed landscape. In the image above, the green block represents the trace, while the purple Validate Event and the orange Forward Event blocks represent spans. The screenshot below shows an actual snapshot of this data in our observability backend.

The Event Broker e-Commerce currently receives and forwards about 6 million events per day. Each event generates at least two spans, leading to about 13 million observability events per day.

The rest of this article will detail how we generate this data.

Manual instrumentation

To make an application observable, it needs to emit telemetry (traces, metrics, or logs). The act of making an application emit telemetry is called instrumenting. There are two types of instrumentation:

Automatic instrumentation
Manual instrumentation

Automatic instrumentation is available in major frameworks, libraries, and platforms. For example, if an application uses the Python requests library, the additional opentelemetry-instrumentation-requests package can be used to auto-instrument every outbound HTTP call. Automatic instrumentation can be a very powerful tool to quickly gain observability insights into existing applications. However, paradoxically, it emits both too much and too little data for most use cases.

Automatic instrumentation emits too much, because it will create a span for every process the library is instrumented for, leading to an overload of data. This will be expensive at scale (most observability backends charge on data ingested), and might contain irrelevant data.

Automatic instrumentation also emits too little, because the data emitted will be generic; the requests example will tell you exactly which HTTP requests were performed, but nothing about the business context. For example, it will not tell you which user triggered the request, or if the request was part of a batch process.

The solution to both problems is to use manual instrumentation. With manual instrumentation developers specify exactly what telemetry they want to emit. For example, they can start a span anywhere in the code, set its attributes, start a child span, and add span events to their liking. If done right, this will tell them exactly how the system is behaving at any given time, including its full, unique business context.

Of course writing manual instrumentation is more labor intensive. Like so many other things, it boils down to a trade-off:

Automatic instrumentation offers a fast implementation and will generate a large amount of non-differentiating data.
Manual instrumentation requires a more effort but will deliver exactly the right amount of valuable business context.

The EBE went down the path of manual instrumentation. It allows us to control the volume of the telemetry being generated, and allows us to include the many gritty details we require to understand our event flows in depth.

Root span

As shown in the screenshots and diagrams at the beginning of this post, we can visualize the flow of data as spans combined into a trace. Spans describe isolated processes which literally span a certain amount of time.

Any span can be a root span or a child span. Of course, a child span is any span which has a parent span. The root span is the only span without a parent, and will always start a new trace.

The diagram above shows the relationship between the root Validate Event span and the child Forward Event span. When events are delivered to the EBE, the event is first validated against its schema. This process creates a new root span and trace. When the schema validation succeeds the event is sent to EventBridge and picked up by the egress service, which continues the trace started by the Validate Event root span.

Context propagation

In the OpenTelemetry protocol, a span is considered a root span when no parent is defined. This is an example (taken from the OpenTelemetry docs) of a root span in JSON format:

{
  "name": "hello",
  "context": {
    "trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "0x051581bf3cb55c13"
  },
  "parent_id": null,
  "start_time": "2022-04-29T18:52:58.114201Z",
  "end_time": "2022-04-29T18:52:58.114687Z",
}

This is an example of a child span of the root span above:

{
  "name": "hello-greetings",
  "context": {
    "trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "0x5fb397be34d26b51"
  },
  "parent_id": "0x051581bf3cb55c13",
  "start_time": "2022-04-29T18:52:58.114304Z",
  "end_time": "2022-04-29T22:52:58.114561Z"
}

Note that the parent_id field is empty in the first span, while the second span has filled the field with the span_id of the initial span. Also note that the trace_id of the two spans is the same. This implies that the second span somehow needs to know the trace_id and span_id of the first span. Forwarding these identifiers so a trace can be continued by another process is called context propagation.

Let’s take another look at this diagram. The ingress Lambda function creates the root Validate Event span with a unique trace_id and span_id. To continue this trace, the Forward Event span needs to know these values (in other words, the context needs to be propagated from the first span to the second). Between the two functions we have an EventBridge event bus and an SQS queue. So how do we forward these values?

The answer is by enriching the JSON event itself. When producer applications deliver an event at the EBE, we enforce the use of a JSON envelope. The envelope looks like this:

{
  "metadata": {
    < some mandatory fields, like timestamps >
  },
  "data" : {
    < the original producer event >
  }
}

When the inbound event has been successfully validated, the ingress Lambda function adds the OpenTelemetry trace data to the metadata section, like this:

{
  "metadata": {
    < some mandatory fields, like timestamps >
    "otel": {
      "traceparent": {
        "trace_id": "0x1234...",
        "span_id": "0x5678..."
      }
    }
  },
  "data" : {
    < the original producer event >
  }
}

The enriched event is published onto the EventBridge event bus, which allows the downstream Lambda function to continue the trace with the provided IDs. This solution is technology-agnostic. In our case we use EventBridge and SQS, but it might as well be Kafka, HTTPS, or even carrier pigeons. As long as the JSON event is successfully delivered, the trace can be continued.

Span enrichment

In the previous sections we’ve covered the use of traces and spans. These give us high level insights into the behavior of our systems, but they do not yet allow us to “understand and explain any state our system can get into” (see part 1). For example, when the Validate Event process fails, simple spans will not tell us the error details, the source of the event, the size of the event, or the protocol used. To add these details (and more importantly, to group, query, and filter them), we need to enrich our spans.

Span attributes

Span attributes are key-value attributes applied to the entire span. They can be freely added to spans by the engineers instrumenting the code. The attributes are used to add technical and business context to a span. Example technical attributes could include:

function-as-a-service cold start: boolean
function-as-a-service memory size: int
function triggered by: sqs | api gateway | lambda function url

Example business context can include:

customer id: string
subscription tier: free | basic | premium
shopping cart item count: int

These values will be incredibly valuable when trying to understand system behavior patterns, and when investigating incidents. With the right attributes you might find that an incident is isolated to Lambda functions with a limited amount of memory or a certain version of a library. Or you might find that an incident only affects paying users, which might bump the severity. The possibilities are endless, and you will find that your telemetry becomes exponentially more valuable as you continue to add more attributes.

Span events

Where span attributes apply to the entire span time range, span events add point-in-time details to a span. This is especially valuable for visualizing the complex internal behavior of a process — first it took action A, then B, then an exception occurred, which was handled by action C. As we will cover in part 4, span events are a great help when developing new features.

Interestingly, span events can have span event attributes as well. Like the span attributes add context to spans, so span event attributes add distinguishing context to span events. For example, a “Message sent to SQS queue” event could have distinguishing sqs_queue_name and sqs_batch_size attributes. These details will help engineers better understand the behavior of their applications.

We think of span events mostly like the centralized logs we’ve been writing for decades. Span events and centralized logs are both point-in-time based, can contain attributes, can be searched, filtered and queried — but span events offer a standardized, vastly richer ecosystem, and have a strong relation to their parent trace and span contexts.

Choosing span attributes or span events

There are a few important differences between span attributes and span events:

Span attributes can be used to query and filter spans. For example: show me all the spans where faas.coldstart = true. Span events cannot be used the same way. For example, the following query is not possible: show me all the spans which have an event named “Message sent to SQS queue”.
Span attributes apply to a span of time, while span events reference a point in time.
Span attributes are free on most modern observability backends, while span events are counted as individual units. This means that a “wide event” with six span attributes is seven times as cheap as a single span with six span events.

Because of these properties, we advise to use span attributes liberally and use span events sparingly. However, there are two nuances:

Always emit exceptions as span events. This is a standard OpenTelemetry practice, natively supported by all OpenTelemetry SDKs. The format of exception events has been standardized, and having this data available when an incident occurs will deliver outsized benefits.
Consider instrumenting your applications with many span events, but use them for development purposes only. This can be achieved by disabling or stripping (non-exception) span events in production environments.

Span enrichment in the EBE

The EBE applies both span attributes and span events, and unsurprisingly we follow the guidelines set out above: we add span attributes liberally, add span events sparingly, and strip production span events before forwarding the telemetry to our observability backend. This process will be covered in more detail in part 3: processing and sampling.

Our most important technical span attributes are:

service.name: the service that emits the telemetry.
service.version: the version of the service that emits the telemetry.
faas.name: the name of the Lambda function.
faas.version: the version of the Lambda function.
faas.memory_limit_in_mb: the Lambda function memory size.
faas.execution: the unique ID of the Lambda invocation.
faas.coldstart: boolean indicating a Lambda cold start.
app.event.eventbroker_latency_ms: the time from initially receiving the event from a producer until delivery of the event at a consumer.
app.event.consumer.attempt_number: the attempt number of event delivery at a consumer.

The most important business context is added by the following attributes:

app.ebe.ingress.original_payload_size_bytes: the size of the event as the EBE receives it from a producer.
app.ebe.ingress.eventbridge_payload_size_bytes: the size of the event as the EBE forwards it to EventBridge.
app.event.producer.application: the application delivering the event.
app.event.producer.event_type: the event being delivered.
app.event.producer.event_type_version: the version of the event being delivered.
app.event.producer.ingress_type: the protocol used to deliver the event.
app.event.consumer.application: the application the EBE is delivering the event to.
app.event.consumer.subscription_id: the ID of the consumer the EBE is delivering to.
app.event.consumer.egress_type: the protocol used to to connect to a consumer.
app.event.consumer.target_endpoint: the endpoint the EBE is delivering the event at.

These attributes allow us to group, query, and filter the spans in many dimensions. For example, we could ask for all Validate Event spans where the producer application is “iot”, and the event size is over 1 KB. Or we could ask which consumers failed to receive a message at first attempt, and what the average size of the affected events is.

We also add span events for significant junctions in the event flow, as shown in the screenshot below. We will cover the usage and value of these events in more detail in parts 3 and 4 of this series.

Semantic conventions

In the previous section we have introduced a number of common EBE span attributes, like faas.coldstart and app.event.producer.event_type. Theoretically we could hardcode these keys in every span, like this:

self._tracer.start_span(
    name=span_name,
    attributes={
        "faas.coldstart": is_coldstart,
        "faas.execution": context.execution_id,
    },
)
self._tracer.egress_span.set_attributes(
    {
        "app.event.consumer.egress_type": self._config.egress_type,
        "app.event.consumer.target_endpoint": self._config.invocation_endpoint,
        "app.ebe.egress.trigger_type": self._source_type.name,
    }
)

However, this is very fault sensitive. If two developers are working on two different services, one might use faas.coldstart, while the other chooses faas.cold_start. Or one might use app.event.consumer.egress_type and the other uses egress_protocol. Inconsistency in these attributes will lead to unreliable or difficult querying. Strong consistency, on the other hand, will make queries predictable and could lead to unexpected new insights.

To guarantee consistency, the EBE uses semantic conventions, which is just a fancy term for centrally defined attribute names. In our case, the semantic conventions are defined in a private library named ebe-opentelemetry-lib. All Lambda functions import this library, and reference its constants when adding attributes to spans and span events. The updated example looks like this:

from ebe_opentelemetry.semconv.definitions import (
  ResourceAttributes as ORA,
  SpanAttributes as OSA,
)
 
self._tracer.start_span(
    name=span_name,
    attributes={
        ORA.FAAS_COLDSTART: is_coldstart,
        ORA.FAAS_EXECUTION: context.execution_id,
    },
)
self._tracer.egress_span.set_attributes(
    {
        OSA.APP_EVENT_CONSUMER_EGRESS_TYPE: self._config.egress_type,
        OSA.APP_EVENT_CONSUMER_TARGET_ENDPOINT: self._config.invocation_endpoint,
        OSA.APP_EBE_EGRESS_TRIGGER_TYPE: self._source_type.name,
    }
)

The semantic conventions in the ebe-opentelemetry-lib itself are very simple. They are just a list of constants like this:

class ResourceAttributes:
    """Semantic conventions for OTel resource attributes."""
 
    APP_EBE_STACK_PREFIX = "app.ebe.stack_prefix"
    FAAS_COLDSTART = "faas.coldstart"
    FAAS_EXECUTION = "faas.execution"
    FAAS_MEMORY_LIMIT_IN_MB = "faas.memory_limit_in_mb"
    FAAS_NAME = "faas.name"
    FAAS_VERSION = "faas.version"
    SERVICE_SUBCOMPONENT_NAME = "service.subcomponent.name"
    SERVICE_VERSION = "service.version"
    SERVICE_NAME = "service.name"

Using semantic conventions guarantees the strong consistency we’re aiming for. All spans use the same attributes, so all queries become predictable and uniform. For example, a simple query can quickly uncover the number of cold starts across all our spans, including their total count and p90 durations:

Baggage

Generally a downstream process does not know the context of an upstream process. For example, the EBE egress service does not know which producer generated the event it is processing, the protocol over which it was delivered, or the time at which it was delivered. OpenTelemetry is designed for scale, and does not support “joining” downstream telemetry with upstream attributes. However, the upstream details might be valuable to a downstream process. Having access to the ingress producer in the egress spans, for example, will allow us to query all Forward Event spans for a given producer. Likewise, knowing when an event was received by the ingress service will allow the egress service to calculate the start-to-end latency of the entire EBE process. The mechanism to forward data between spans in a trace is called baggage.

Baggage has a lot of similarities with context propagation. Like context propagation, baggage is generated by an upstream process, forwarded to the downstream system, and extracted to enhance the downstream telemetry. And like context propagation, we use the metadata section in the JSON envelope to transport baggage. In the EBE, this looks like this:

{
  "metadata": {
    < some mandatory fields, like timestamps >
    "otel": {
      "traceparent": {
        "trace_id": "0x1234...",
        "span_id": "0x5678..."
      },
      "baggage": {
        "app.event.producer.application": "iot"
        < more key-value baggage >
      }
    }
  },
  "data" : {
    < the original producer event >
  }
}

It is important to note that including baggage in the data will not automatically include it in the span attributes of the downstream telemetry. This needs to be manually handled, for example like this:

self._tracer.egress_span.set_attributes(
    {
        OSA.APP_EVENT_PRODUCER_APPLICATION: <baggage from event>
    }
)

The observant reader will realize this is another great use case for semantic conventions: it guarantees both spans use exactly the same key to represent the same data.

Exporting telemetry from Lambda functions

The final section of this article will focus on getting the telemetry data out of the Lambda functions and into a telemetry processing pipeline. OpenTelemetry has a protocol based on protobuf (the OpenTelemetry Protocol or OTLP). This efficient binary format allows producers, collectors, processors and backends to exchange OpenTelemetry in a standardized way. There are a few methods to emit OTLP from a Lambda function, but not all are suitable for our use cases.

OpenTelemetry Collector

The first mechanism is to run an OpenTelemetry Collector in your environment. This is a container-based application which listens on the OTLP network port, receives data, processes it, and forwards it to its configured destinations. It allows telemetry producers, like our Lambda functions, to quickly offload their telemetry data and efficiently batch data before sending it to an observability backend. Unfortunately, the Collector is not a viable solution in our serverless environment. We run no container-based solutions, so we would need to configure and maintain an ECS cluster just for the Collector. Additionally, most of our Lambda functions do not run in a customer-owned VPC, so they have no network access to a VPC-based application. We would have to migrate all of our Lambda functions to a VPC just to interact with a Collector, which adds complexity and goes against our serverless principles.

AWS Distro for OpenTelemetry

The second option is the AWS Distro for OpenTelemetry (ADOT) for Lambda. This takes the same Collector as above, but uses a Lambda layer to run it in the Lambda execution environment. This removes the need for containers and private network connectivity, but introduces a number of new problems. Remember that we process millions of events per day, leading to high concurrency. Each of these concurrent Lambda execution environments boots and runs its own Collector, which takes up resources and significantly increases cold start durations. In a latency-sensitive environment like ours, this is unacceptable. This mechanism also removes the opportunity to efficiently batch telemetry before sending it off to an observability backend. Instead, every Lambda function forwards its own telemetry to the backend immediately, leading to high additional execution time. In some experiments ADOT increased our Lambda execution times from 10ms to 240ms, which would result in thousands of additional dollars in monthly Lambda costs.

Custom Kinesis Exporter

In the end, we created our own natively serverless solution which resolved all the issues described above. We wrote a custom Kinesis exporter by implementing the SpanExporter interface. The KinesisDataStreamsSpanExporter receives a batch of spans generated in a Lambda execution and forwards them to a Kinesis Data Stream (KDS) as OLTP. The OLTP format is natively binary, which results in very small data transfers. Kinesis accepts the binary format and allows consumers to receive and process the telemetry. We will cover this process extensively in part 3.

We have included our custom kinesis exporter in the ebe-opentelemetry-lib we discussed before. Since all Lambda functions already import this library for semantic conventions, they all have the KinesisDataStreamsSpanExporter available too. Configuring a Lambda function to use the Kinesis exporter is as simple as this:

from ebe_opentelemetry.otel_kinesis_exporter import KinesisDataStreamsSpanExporter
w 
...
 
self._tracer_provider.add_span_processor(
    BatchSpanProcessor(
        KinesisDataStreamsSpanExporter(
            self._otel_kinesis_stream_name,
            connect_timeout=0.2,
            read_timeout=0.2,
        )
    )
)

The KDS batch offload process is very quick, measured in single-digit milliseconds. This is an acceptable duration, without significant impact on our Lambda bill. It is also very resource efficient, not affecting our memory usage or cold start times in a measurable way.

Conclusion

In part 2 of the Serverless OpenTelemetry at scale series we have extensively covered the various ways we have implemented OpenTelemetry in the EBE Lambda functions. OpenTelemetry has quite a lot of details to be aware of, and more moving parts than you might expect at first glance. We hope that the detailed description of our usage will help you make the right decisions for your own applications.

In the next part we will cover OpenTelemetry processing: fan-out, data manipulation, span event stripping, and most importantly: sampling. See you there!

Originally published at http://lucvandonkersgoed.com on November 16, 2023.