Building the Belfry

How we developed a customizable A/B testing framework at Oscar

How Belfry came to be

At Oscar, we pride ourselves on using simple language to help our members navigate a health care system that’s filled with complex jargon at every juncture. Our product team puts extra care into making sure the text on our web and mobile apps addresses our members like ordinary people, not medical data processors. After working at a health tech company for a while, though, it’s easy to lose some perspective. (Doesn’t everyone know what an Explanation of Benefits is? Nope!) We needed a way to ensure our calls-to-action and explanations were clear enough for our members to make informed choices and take the right actions for their health.

A critical need of any data-driven company is the ability to validate product decisions through A/B testing and experimentation. Those of us who get caught up in product debates certainly appreciate a more objective way to resolve them! At Oscar, we wanted to make experimentation within our apps easy and painless (for both our tech teams and our members). This effort matured into an in-house A/B testing system we call Belfry.

Initially, we used feature flagging to conduct our A/B testing. (Hence the evolution of the name Belfry — a belfry is the part of a bell tower that contains the actual bells, and is often topped with a flag. The Belfry is also an excellent bar in New York City near our office.)

Belfry was built to provide a set of features that we found ourselves rewriting in various places in our codebase over and over again. The repeated business logic usually covered:

  1. How to map each user to a variant in the experiment.
  2. Collection, storage, and retrieval of data from the experiment.

We considered using an external vendor, but there are multiple benefits that tech companies can receive from building our own A/B testing system. In health tech, it’s pivotal to have HIPAA-compliant vendors to protect our members’ privacy, which takes a fair bit of research and due-diligence on our end. Beyond that industry-specific limitation, building an A/B system in-house was a worthy investment because we needed to be able to fully integrate it with our multi-application infrastructure in such a way that was customizable and iterable. Additionally, Belfry has applications beyond A/B testing, like updating static configurations that need to change often.

How it works

Belfry describes the overall system of tools we use to conduct A/B testing. It consists of a database to store experimentation results, an Oscar internal app that allows users to conduct an experiment, and an API and Service in our codebase that Oscar’s different apps can talk to.

We use an internal app to set up experiments in Belfry. Any Oscar employee is able to configure an experiment that they wish to run. The tool allows us to define a list of JSON blobs, called configurations, that represent the variants in the experiment. For instance, if we wanted to test the copy on a call-to-action button, we might define the following two configurations:

Configuration A — {“text”: “Click Me”}

Configuration B — {“text”: “CLICK ME NOW!”}

Then, we might decide to expose 90% of our users to configuration A and 10% of our users to configuration B (clearly the more aggressive option). Through an API, an engineer can hook up the button component to Belfry, letting Belfry control what the text displays. The same API allows us to report various actions that we wish to measure for the experiment — in this case, we want to measure how many times the link is seen and clicked.

As the experiment runs, Belfry automatically collects and buckets the data. It keeps track of who has seen the experiment, as well as any actions that those users have taken. At any point, we can request an aggregated summary of how each variant is performing. We can even make live changes to the experiment and have them take effect instantly.

Belfry is set up so that anyone at Oscar can configure an experiment.

Using Belfry, we can now set up a new experiment in less than an hour versus what used to be a few days worth of coding, code reviews, and deployment.

Now that we’ve gone over the high-level details, let’s do a deep dive into one of Belfry’s most important features: user bucketing.

User bucketing

Belfry has the built-in ability to bucket a pool of users across the variants of an experiment. Before Belfry was created, most of our experiments simply bucketed users based on whether their user ID was even or odd. While easy to implement, this is a suboptimal approach when it comes to collecting high-quality data.

Why? Running any experiment causes the population to become inherently biased. Let’s take an extreme case. Suppose we’re running an experiment for our web app. Variant A is the control variant: people see everything on the website normally. Variant B changes the website into a terrible experience: we make text barely visible and super small, and add animations when you search for a doctor that obscure the actual result.

If we happen to reuse the same approach of bucketing users for another experiment (let’s call the new variants A’ and B’), the users in B’ are already upset before the experiment has even begun! Even if B’ introduces an amazing new feature, it’s likely that most of the users have already decided not to come back to the website.

What we would want to do, instead, is somehow distribute the new experiment’s population such that an equal number of people from each old variant get distributed into each new variant. That is, within each new variant, we want a good mix of normal people and upset people:

This lets us measure the results of the new experiment with a balanced population in both A’ and B’. The fact that there are some upset people in A’ is counterbalanced by the equal proportion of upset people in B’, keeping the overall experiment unbiased.

Sometimes, ideas that sound great on paper end up being terrible in practice. A/B testing surfaces those issues early.

Within Belfry, this problem is handled by assigning each experiment a randomly generated string, or salt. To determine which variant a user belongs to, we combine the salt with their user ID, hash the resulting value into a hexadecimal string, and take a modulo of the hashed value.

user_id: “1234567” + salt: “a5fb0342d7c98” -> “1234567a5fb0342d7c98”
hash(“1234567a5fb0342d7c98”) -> “60ae9648aedfde1675d4a9816f94”
“60ae9648aedfde1675d4a9816f94” % 2 -> 0 -> variant A

Here, with two audiences, we took the modulo 2 value of the hash, with a result of 0 corresponding to variant A and a result of 1 corresponding to variant B.

Even with the same pool of user IDs, this approach introduces sufficient randomness to the bucketing that we more or less achieve the unbiased user pools idealized above. This concept can be extended to an arbitrarily large number of concurrent experiments, or restricted to running across a small percentage of your user population, provided that the user pool is large enough.

Belfry in action

Our first test with the Belfry system involved a very simple call-to-action (CTA) test on the homepage of our Provider web app — an app that Oscar doctors, nurses, and caregivers use to treat their Oscar patients. This test modified the copy of a link from “Resources” to “Provider Tools”.

After about four weeks of testing and 40,000 cumulative impressions, we found that our variant, “Provider Tools”, successfully drove 34% more clicks with 99% statistical significance. We were then able to ship the new copy with full confidence that it was the superior option.

Resources (left) drew 34% fewer click throughs than Provider Tools (right). Sometimes, two words are better than one.

One of our main goals when building Belfry was to reduce the barrier to entry when it comes to A/B testing. We firmly believe in the importance of proliferating this experimentation mentality and ingraining it in the culture of our teams. As a company dedicated to providing the best possible user experience, we want to constantly strive to understand exactly what that means. A configurable in-house A/B testing platform allows us to understand our users more than we ever have before with minimal effort.

This post was co-authored with Alessandro Cetera.