Photo credit: Claire05

Lessons Learned Developing an A/B Experimentation Tool at Walmart Labs

A Metric Challenge

This year, Expo, the A/B experimentation platform that was built for Walmart apps and websites has reached five years of maturity. With this milestone has come many challenges designing a platform to support such a massive scale of operations and number of teams. In the early design of our tool we missed out on building in some basic good design principles.

With one feature in particular we didn’t look ahead to what the feature may become in the future and didn’t start off with a flexible enough design that could evolve to that future. As a result of this early design, this feature became one of the larger challenges that forced us to go through a number of small and large scale redesigns over the years. This feature was metrics.

In order to understand if we could have avoided, or at least minimized, the large scale redesign of metrics we’ll need to first understand the software environment of Walmart at the time. We’ll get into that in the section “The way things were” as we take a self reflective look back during the early design of this feature. Before we can dive into what we did wrong I need to first explain what a metric actually is at Walmart.

One Metric from Expo

Metrics

I am the front-end tech lead of Expo. My team provides the interface for internal Walmart users to create, run and monitor experiments on the various Walmart apps and websites.

Metrics are one of core foundations of our tool and an integral part of how things work. With A/B testing you need to be able to track when a user visits a page or when a user has been exposed to a different variation of a page. We call this tracking a metric and we use this metric to report on how a modified version of a page, or variation, is performing against the default version of that page, or a control.

Metric vs. Measurement Point vs. Beacon

The word metric has caused much confusion in the past, especially within the Expo team. This is because depending on what phase of the experimentation lifecycle you are in a metric will go by different names. Let me first start off by clarifying some of the terminology we use at Walmart and give an overview the phases these terms fall into.

Definitions

  • Beacon is an object of data sent by different components from Walmart.com fired at various page loads or user actions.
  • Measurement point is used by Expo to track and report on how an experiment is performing. Beacon is one type of data that is evaluated by a measurement point.
  • Metric is typically some numerical count or percentage that is displayed in a reporting graph or chart that provides insight on the performance of the experiment. Metrics are based on measurement points.

A/B experiments have two primary metrics that are needed: Assigned and Qualified.

  • Assigned are users that visit the Walmart app or site and who get assigned into a running experiment (either control or variation).
  • Qualified are users who have been assigned into a running experiment AND who have visited or seen the page or element containing the different variant or treatment.

Now lets take a high-level look at the flow of these terms in the different phases of experimentation.

Expo Metric Flow

In phase 1 an experiment is setup and configured with measurement points. Once an experiment is ready to go we start it, meaning that it is ready for live traffic (ex. customers visiting Walmart.com) to get assigned. In phase 2 customers are assigned to the experiment and the user’s experience (ex. click-through data) is captured via beacons. In phase 3 the user data is analyzed and reported on back to Expo in the form of metrics!

Action & Context

A measurement point, in the context of Walmart.com, is something that maps to attributes of a beacon. A beacon has many attributes associated with it, one of which is action. The action attribute relates to the page which fired the beacon. Meaning there was an event, in this case a page load event, that triggered a call to a beaconing service API. For example, the homepage has an action of “homepage”.

In some case we may need a little more context about this event. The beacon object also has an attribute called context. The context attribute on homepage is simply Homepage. In other cases the context attribute may be more descriptive to the event which fired it rather than directly correlating with the action.

In the early design phase of the Expo these two attributes, primary the Action attribute, made up the foundation for how the tool understood measurement points and metrics.

The way things were

When I joined the Expo team four years ago our definition of measurement point was very simplistic. App teams developing various pages and components would define the action and context for their beacons. Our Expo team would update our tool to support the new measurement points and metrics. The world was simple and well defined.

When the UI needed to read reporting data from time series databases it used the defined metrics with action and context. When the UI needed to populate a measurement point dropdown field the values were action and context. Throughout our UI these two attributes were the foundation for much of the code surrounding measurement points and metrics. Because of this early adopted and simplistic definition of a metric much of the early design around metrics in the UI and service was hard coded around the action and context attributes.

A Hard Coded Problem

Like anything hard coded in a software developer’s world, changes mean new builds. This reality was acceptable when there were only an occasional request every few months for new metrics. Doing a new build wasn’t too much of a hassle. But when we started getting handfuls of requests for new metrics every sprint it went from just an annoying task to eating into the velocity of what we could produce.

Below is a snapshot of our service and UI code from a few years ago. The service code below defined the actions, what Expo used to map measurement points with beacons, as type-safe enumerations.

Service code: Action enumerations

And here is how the UI defined a metric to be displayed in the various dashboards and real-time monitors.

UI code: hard coded Metric

You can see from this code that when a requirement came in to add a new measurement points we had to make code changes in both the service and UI, publish new builds for each, and deploy. (You’re thinking, maybe selfishly they built job security into their tool?)

Job security = every developers dream?

Configuration Based Solution

A few years ago I came up with a design to address the front-end UI portion of our hard coded problem (Yes, we partially fixed this problem years ago, sorry job security). At the time, due to external systems that call our service, we didn’t have the option (or high enough priority) to do a redesign of the service enumeration code. Part of this solution I developed was a configuration based metric definition model. This model was a single multipurpose JSON structure that was used by the UI in the following ways:

  • In an experiment setup form to populate a multi-select dropdown of metrics to be reported on (see definition on assigned)
  • In an experiment setup form to populate a multi-select dropdown of metrics to be qualified on (see definition on qualified)
  • For fetching the analytics data generated by these metrics from three different reporting sources: a time series database (kairos), indexed mapreduced NoSql database (Couchbase), and an older in house stream processing system (mupd8)
  • Real-Time dashboard display of metrics from different reporting sources
  • 24hr monitor display of metric data with ability to display data from different reporting sources
  • Experiment results report with ability to display data from different reporting sources

A core component of this solution was what I called a “metric definition” model. This model is what allowed us to define a metric and this same object could be used in all of the areas listed above in the UI.

Below is an example of one of the metrics defined in this model. This metric is one for a made up page called TV View and it reports the count to people who visited the Walmart site, were assigned to an experiment AND navigated to this TV page. In the model we have attributes that relate to how to fetch data for this metric from different reporting sources and how to display data for this metric in different reporting tables.

Metric Definition model (JSON)

You can see from the example there is a bunch of duplicate attributes. This model was intentionally designed as a Frankenstein object, a JSON object merged together from many other objects. The goal when designing this model was to present a single configuration point to add a new metric while leaving much of the existing code that uses these object intact. The alternate option was to refactoring many different and complex pieces of the code that use this data. In designing this feature we weighed in the time and effort of refactoring and decided the cost was much greater than going with this metric definition solution.

This solution worked great for the past couple of years. Because the metric definition was configuration based meant that it didn’t require code changes or deployments for the UI. It was easy enough for the development team to add in new configurations when we got requirements for new metrics or when we had a new tenant that wanted to use our tool, and requests for new metrics were not frequent enough to interrupt our daily work.

As things go, things evolve

As things evolved such did the need of a more refined method for identifying a metric. No longer was a simple two attribute model (action and context) enough for more in-depth reporting on events. Soon came the requirement to allow the experimenter to dig into these complex attributes within the beacon object so they could be qualified and reported on. Rather than an ever growing static set of measurement points we found a way these metrics could be defined dynamically.

The way things are

A couple years after developing the configuration based solution our Expo tool looks very different. This past year our team designed a “smart measurement point” for which we have a patent pending. This design no longer has fixed enumerations for measurement points in the service. Measurement points are identified using a dynamic query expression language (the details of which I will leave for a future post by one of my more capable teammates). For now I will just give a high-level overview.

Static to Dynamic

Most experiments are still setup in Expo with measurement points configured on those two original attributes: action (a) and context (ctx). Which is still supported and can be written as follows as a query expression.

query_map[‘a’]==’pageview’ and query_map[‘ctx’] == ‘tvpage’

However the real power of this query expression language is that it allows us to dive into that complex beacon object and match on more refined attributes. Below is a partial beacon from Walmart.com from a page with TVs. One note on this, I say partial beacon because I remove about 90% of the beacon data to showcase a simple example.

https://beaconserver/beacon?bla={"bla1":{"i":"123","j":"TV"}}&a=pageview&ctx=tvpage

The corresponding query expression that would match on the above beacon would look something like this.

query_map['ctx'] == 'tvpage' and query_map['a'] == 'pageview' and query_map['bla'].indexOf('bla1')>0 and query_map['bla1'].indexOf('i')>0 and (c:extract(query_map['i'],'bla1.i') == '123' or c:extract(query_map['j'],'bla1.j') == 'TV')

You can see the URL parameter “bla” has a value which is an object. Within this object we want to match on a specific set of IDs, in the above example “123” or “TV”.

As you can see the query expression for digging into the beacon object can get pretty hairy. Searching for index match, extracting data by keys, logical expressions, etc. I assure you the underlying code to get this working is equally as complex and thought out. The team did an amazing job building out this powerful feature and it was no small task.

Refections and Take Aways

If we had put all this effort in the early stages of building out Expo there is still a chance that we might have got things wrong. Trying to predicting future behavior, especially software behavior, five years out is probably a bad idea (equally as bad to try and predict future human behavior). Additionally, all that design work upfront would have been over-engineered for the 99% of use cases in the past five years. That is not to say there would not have been some level of early adoption of smart measurement points.

Look Ahead

A good practice during the design phase is to look ahead to the future and be aware of the cost and impacts if you needed a major redesign. But also be careful to not over-design upfront. We have had plenty of features that we over-engineer with features that we never ended up having a use case for.

Design the tricycle that can become a self-driving car

Flexible Design

Make sure your initial design is flexible enough for evolution to occur, i.e. be wary anytime your design includes enumeration. We had good reason at the time to build in enumerations to our service, primarily to enforce restrictions on how are APIs were used by external systems. Looking back we could have accomplished these restrictions in ways that didn’t require a code change to add a new action. Would such as design have helped us evolve to smart measurement points with much less pain? The same could be said for the early design of our UI and making considerations for a configuration based design rather than hard coding metrics.

Considerations

When designing a major feature that impacts foundational elements of your application you should consider the following:

  1. As the platform evolves what will be the biggest pain points of the current design?
  2. What are the cost and trade-offs of doing a more future proof design upfront?
  3. What are the estimated cost and effort of doing a major redesign in the future?

Conclusion

Expo has come a long way in the past five years. It has evolved a ton, each time becoming increasingly more mature as a platform for Walmart. It has helped launch some of Walmart.com’s largest changes such as the latest redesign of the website launched in May 2018. Requirements will continue to evolve (hopefully for us metrics are done evolving for now). There is no masterful future proof design that will curtail evolution, especially in software. Your best tools are to be thoughtful and aware in your design upfront and minimize the ways you maybe hard coding yourself into a box.