Expresso: a framework for multivariant testing & feature gating

OfferUp
OfferUp
Published in
4 min readFeb 14, 2018

At OfferUp, we use many modern development techniques, such as:

  • Microservices. OfferUp backend consists of a collection of services (service mesh).
  • Fast release cycles. We strive our microservices to be released on demand, sometimes as frequent as after every major code merge.

Instead of long running branches we use feature gates to manage releases of new features. The new code is merged directly into master branch, after it’s code reviewed and tested. On average, each code change takes about 2–3 days to integrate. That said, the development of a new feature could take a few weeks, or even a few months, depending on its complexity. Expresso allows merging code changes while making sure that the feature under development is only accessible to the dev team. As the development of the feature progresses, it’s opened up to more engineers and a select group of OfferUp employees to test. Once the feature is stable, it’s released to beta testers and then to the rest of our user base.

Multivariant testing is another aspect of releasing features, and because OfferUp is extremely data-driven, we A/B test multiple flavors for each new feature.

Design

A simplistic feature gating “framework” is nothing more than just a few lines of code:

If (user_id % 100 < 10) { // the feature is released to 10%

// run the new feature

} else {

// run old version of the code

}

However, there are a number of problems with the code above…

  • It is not configurable, because the feature that is released to 10% of users, is hardcoded.
  • Ideally, there should be a UI which shows the release status of all projects within a company.
  • The code doesn’t support multivariate testing. There are only two variants: feature on and feature off.
  • No proper support for A/B testing. (On treatment/Control treatment). More on this later.
  • If multiple features are released using this code, the same users will see all new features.
  • It’s way too easy for a third party to reverse engineer, who will see the feature. Just look at the last two digits of userid.
  • % operation in Java can return negative values which might throw off all math.

As you can see from this diagram, the main unit of Expresso is a feature, and each feature can have a fixed, exclusive set of treatments. The decision to assign a user to a particular treatment is done by using assignments. Assignments consists of a predicate and allocation. Predicate potentially can be any property of a user. For example, right now, we support geo-location, employee role, and a fixed set of IDs. Allocation is nothing more than a percentage-wise distribution of users per treatment. If a user ends up not being assigned to any treatment, they end up in the default.

Users are assigned to treatments based on a hash of their user_id and a salt. It’s important to use hashed user_id for a number of reasons, such as making it impossible for 3rd parties to trivially predict which users belong to which treatment. It’s also important to provide a distinct salt for each feature or assignment. This way, different users will hash to different values for different experiments and the same users won’t end up in the same buckets for different experiments.

Expresso loads config from a Datasource, and currently S3 is one of the datasource supported in the library.

Another benefit of Expresso is routing traffic to different fleets of machines. We modified a Linkerd router with Expresso to make routing decisions. By doing this, we were able to provide support for canaries.

Learnings

Early on, we learned that any experiment should have an “experimental” and a “control” group of the same size and same type of users. The original thinking was that we would be able to compare an “experimental” treatment with the default “no experiment” bucket. These two groups were designed to be of a vastly different size. As it turned out, some metrics are meaningless when comparing them between groups of different sizes. We recommend the following A/B testing strategy: first, launch new feature to 5%/5% feature/control, then ramp to 10–10, then 25–25, and so on. This strategy allows for the maximum size of rollout groups of 50–50. At that point the decision should be made on whether to rollout to 100%.

Another interesting discovery was around what happens when an experiment is rolled out to a wider audience. It’s important to make sure that your “control” users won’t become “experiment” users during the course of rollout. The best strategy would be to setup your experiment in such a way that every time your experiment (and therefore control) are widened, users from the default treatment are moved into experimental and control groups.

Future Work

As many other projects at OfferUp, Expresso includes only bare minimum functionality. At the moment, we do not have the luxury of being able to spend cycles on non-core features, but soon, we will add support for user roles, UI, etc. We will be open sourcing Expresso in the coming months.

There are frameworks similar to Expresso out there. There are a some commercial and open-source alternatives. At this point Expresso worked best for us for the library level integration we were looking for.

Leo Liang, Pradyumna Reddy, Marius Popa, Jiezhiong Mo and many other OfferUp engineers contributed to the Expresso framework.

Authored by: Alexey Maykov, Leo Liang

--

--