Expanding Experimentation at Walmart using W3C Trace Context and Baggage

Yanardy Sanchez
Walmart Global Tech Blog
6 min readDec 22, 2020

At Jet.com — which Walmart purchased in 2016 — we had an in-house built A/B testing solution, Phaser, which made use of a telemetry header that all services were required to propagate. Phaser injected experimentation metadata into the header and as a result all services enjoyed out-of-the-box experimentation for free.

At Walmart, the A/B testing system, Expo, doesn’t currently enjoy that same luxury and therefore primarily serves front-end services. To bring experimentation to a wider service audience at Walmart we’ll have to integrate Expo into a distributed tracing system in the same way Phaser was.

Rather than integrating into an existing tracing implementation at Walmart, we’re working with the Walmart telemetry team to implement one based off the W3C Trace Context and Baggage specs; which addresses the need for a standardized approach to propagating a trace and related information.

In this post you’ll learn about Trace Context, Baggage, and how we’re using them to move experimentation at Walmart forward.

Primer on Distributed Tracing

Before introducing the specs, I’d like to provide you with a brief overview of distributed tracing using an analogy — Dot-to-Dots, the web of numbered black dots you could find in the back of a cereal box, or in a kids activity book, that you would trace with a pencil. The black dots are services in your micro-service architecture, and the pencil trace is a unique request from start to finish. Distributed tracing gives you a clear picture of the overall services involved in a request flow and helps link together the information necessary to identify failures and bottlenecks.

Dot-to-Dots worksheet of a Lion cub via https://www.allkidsnetwork.com/

Tracing is implemented by passing data inside of tracing headers that later get pieced together to help create a visualization of the path taken by a particular request. Traditionally, such headers have been designed by tracing vendors, leading to a lack of inter-operability in systems deploying multiple tracing vendors. This is one of the problems Trace Context and Baggage address.

Trace Context Simplified

What it’s for: Trace Context is a proposal of two headers that get updated and propagated as a request makes its way from one service to another. The headers are used to analyze the path a request takes through a system.

What it consists of: traceparent and tracestate:

  1. traceparent(required)

The traceparent header is the heart and soul of a trace. It contains four pieces of information: version, trace id, parent span id, and flags. Trace id and parent span id are important in helping identify the bottlenecks and failures mentioned above. Here’s an example of what traceparent looks like:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01where:
version = 00
trace ID = 0af7651916cd43dd8448eb211c80319c
parent span ID = b7ad6b7169203331
flags = 01

The example shows an instance of the traceparent header as seen by a receiving service. Let’s call that service, Service B. Service B then makes a request to Service C by making updates to the traceparent header following the Trace Context spec by updating it with it’s own span id as the new parent span ID and leaving the trace-id intact. The header received by Service C could be:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-149ad35ecefe7f0c-01where:
version = 00
trace ID = 0af7651916cd43dd8448eb211c80319c
parent span ID = 149ad35ecefe7f0c <- only change
flags = 01

Notice that the trace ID remained the same and only the parent span ID changed. The trace ID identifies the entire request and the parent span ID identifies the particular code in Service B making the outgoing request.

2. tracestate(optional)

The tracestate header is used for systems that support more than one tracing system (i.e. one system using Rojo and another Congo). tracestate is a place for them to pass along platform-specific information to allow these systems to correlate traceparent headers with the data collected by these platforms. Here’s an example of the header:

tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE

Both Rojo and Congo are specifying information they’d like to have attached to the trace. This piece of information is used in their respective systems.

How Trace Context helps move experimentation forward: These two headers themselves don’t enable experimentation — that’s where the Baggage spec comes in — but they help promote a standard approach for tracing, making it easier to evangelize distributed tracing across the tech organization.

Another argument for adopting Trace Context is the growing number of open source SDK implementations of Trace Context driven by the OpenTelemetry community. Examples include support for popular languages such as Java, Javascript, Python and Go. At Walmart, we’ve built our own implementations that will be made W3C TraceContext compatible for existing users, but will welcome development teams to use W3C TraceContext compatible matured community-built libraries.

Baggage Simplified

What it’s for: Baggage is a proposal of a single header — baggage — that can be used for adding user-defined properties to a request. Here are a few example use cases from the spec:

  1. If all of your data needs to be sent to a single node, you could propagate a property indicating that: baggage: serverNode=DF:28
  2. If you need to log the original user ID when making transactions arbitrarily deep into a trace: baggage: userId=alice
  3. If you have non-production requests that flow through the same services as production requests: baggage: isProduction=false

At Walmart we’re using the Baggage header to propagate experimentation units. Our header might be used in the following manner:

baggage: someID=someUUID

This would enable experimentation support across all services at Walmart.

What it consists of: baggage

The baggage header follows the same format as Set-Cookie in the browser, a comma separated list of key-value pairs with optional properties.

How Baggage helps move experimentation forward:

In the intro I mentioned how experimentation worked at Jet with Phaser. Out-of-the-box experimentation is key to a successful experimentation culture and having our data propagated to all services is a good first step in achieving that.

The Trace Context and Baggage headers at Walmart would be part of a system utilizing Envoy. With that in mind, Expo would also implement an Envoy Web Assembly filter to parse the baggage header ID and then assign the particular request to a set of experiments. This approach would move us away from assigning users at the network edge and therefore we’d no longer have to worry about our header exceeding a specified threshold at Walmart because assignments would occur at the service level.

Implementing Baggage Limits

The Baggage spec states the following limits:

  • Maximum number of name-value pairs: 180.
  • Number of bytes per single name-value pair: 4096.
  • Maximum total length of all name-value pairs: 8192.

However, the spec doesn’t define strategies on how modifications should be handled and instead will work to define non-normative recommendations. This could lead to application developers blindly dropping keys if the header is full. Given our use case that a randomization unit be present across all services propagating this header, it’s important that we define those guidelines up front. Some questions for consideration:

  • Should the list of keys be ordered?
  • Should the keys be modifiable?
  • Can specific keys be deleted?
  • If the list is full, should old keys be removed (assumes an ordered list)? Or should it not be possible to add new keys?

Since the spec is leaving this up to its users, we’re working on promoting a set of guidelines to restrict modifications to certain keys.

Conclusion

Enabling experimentation for all services has been a key milestone for the Expo team that is a step closer to being achieved with the adoption of Trace Context and Baggage at Walmart. What makes Trace Context and Baggage different is that by virtue of being a W3C spec and having open source libraries to support it, it becomes an easier sell across the organization. This is important for a company as big as Walmart where shifts towards modern tech can take longer to push across than it would at a small company. Experimentation is important at Walmart, Trace Context and Baggage will play a crucial part in its continued growth and evolution.

--

--