How Flo Conducts Experiments

Published in

Flo Health UK

10 min readAug 25, 2020

Introduction

We should start by noting that Flo has long been committed to having a data-informed culture. We operate using OKR and prefer to ascertain the necessary information and rely on accurate figures rather than feelings. This approach requires effort, skill, and the ability to overcome all kinds of obstacles.

Difficulties can be categorized into those related to the product, analytics, and technology. My post touches on the technical side of things.

You need to experiment any time you launch a new functionality. In addition to running product A/B tests (also known as ‘bucket tests’ or ‘split-run tests’), you’ll need to run technical updates since they can also lead to a decrease or increase in certain metrics. That’s why our motto is “everything is an experiment.”

Why We Need Experiments

If you aren’t familiar with analytics but want to learn more about why experiments are necessary and how to conduct them, you can read this article.

Basically, in terms of mathematical statistics, it just isn’t enough to conduct an A/B test and compare the metrics of the funnels (CTR, CPA, etc.) or some “average” values (LTV, session length, etc.). That’s because you can’t be entirely sure whether the results are reliable. To assess the level of reliability, you need to conduct an experiment under specific rules and analyze the results using mathematical methods of statistics.

Analysts usually handle this part, while developers tend to run the test itself. As such, bottlenecks form in the experimentation process, increasing the experiments’ duration and decreasing their frequency. However, there are testing tools that can help resolve the situation. Some companies use third-party services, while others create their own (we’re in the latter group).

What Are Feature Toggling and Feature Rollout?

Feature toggling is a mechanism for enabling and disabling new features. Server-side, this is the control process through the Dynamic Config service (which I will talk about in more detail below). Client-side, everything looks extremely simple in most cases:

The mechanism is simple, yet reliable and convenient.

As I said, any new product idea in the form of a new or changed feature must first undergo experimental testing. If we decide to transfer the feature to release, we move on to rollout, gradually expand the audience, and finally send the feature to release only if it proves to be successful.

The primary purposes of rolling out a feature are to:

a. Prevent rolling out a feature that causes crashes

b. Prevent metrics degradation due to unforeseen effects. As I said, everything is an experiment.

The difficulty with feature rollout is that you can’t increase the rollout percentage during the experiment. Users start moving between test groups and spoil everything. Simply put, situations arise in which the experiment is already underway when a user that was initially in the control group ends up in the treatment group. The user is, therefore, able to participate in both experimental groups. How do you measure such a user’s behavior and impact on metrics? That’s a good question.

How We Conduct Experiments Today

Our experimentation process looks like this:

● Product managers and analysts formulate a hypothesis and then determine what kind of experiment is needed and for which audience.

● They select metrics to record the uplift. If the required metric isn’t available, it has to be added to ETL to calculate the results of the experiment (see below for details).

● The analysts then calculate the experiment’s required sample size and duration. We don’t conduct sequential testing yet, and we don’t use Bayesian multi-armed bandits either. In other words, we only follow the classic fixed-horizon approach. But we’re working on other areas, so stay tuned. :)

● After that, a task is assigned to backend developers in Jira to prepare the experiment.

What Developers Do:

● Meanwhile, the programmers prepare the necessary configs to send from the server to the client using the Dynamic Config service for the experiment.

● Then they check the validity of the predicates (see below for more details) and look for possible problems or overlaps with other experiments as needed.

● Client-side developers prepare client-side code with feature toggling to ensure that the experiment works correctly.

The general experimentation flow is as follows:

Things to Keep an Eye on:

● Sometimes a user is assigned to the right group and the right experiment, but the key event for the experiment doesn’t occur (for example, the push notification doesn’t reach the user). Therefore, it’s not enough to assign the user to the desired experiment; it’s vital to ensure that the desired event occurs. The so-called “fill rate” (the ratio of all targeted events to those that actually happened to users) is often measured. One example is the ratio of sent push notifications to those actually displayed.

● Various novelty and network effects make it difficult to assess the outcome of the experiment clearly (for example, working offline or users sharing information about the experiment with each other).

● When working with subscriptions, it becomes necessary to implement the so-called “waterfall configs” approach. This involves building several blocks with conditions, one after another, before ultimately applying the block with the settings that the user falls under.

● We don’t know anything about users who only install the application and start using it without registering. Nevertheless, we need to conduct experiments right away (especially on prices and subscription formats). This somewhat complicates the usual procedure.

Dividing Users into Groups

Users are divided into groups using the internal User Profile service and the so-called “predicate mechanism”. In terms of our internal analytics, we can assign any number of parameters to the user that will subsequently help “reach” him or her — for example, a parameter indicating a woman’s age or the fact that she’s pregnant.

Parameters can be chained together using AND, OR, and comparison operators. It looks something like this:

A special service validates this set of parameters. If its validity is confirmed, we can determine on the server-side whether the user falls under the desired experiment. For instance:

The predicate “age<=18 AND experiment (new_feature, 30)” returns true for a random 30% of female users aged 18 or younger.

We also distinguish between “sticky” and “non-sticky” experiments. The former assumes that the user is included in the experiment and falls under a certain group once and for all, i.e., if the same experiment is repeated, he or she will not find themselves in a different group (control/treatment).

This can be achieved in two ways: using a hash function or remembering the user. Randomization itself is essential. The division into two groups must meet the criterion of “randomness,” i.e., the closer we are to real randomization, the better. The hash function is fast and makes it possible to obtain the same value for the user every time, thus “remembering” them. In this case, the main thing is that the hash function itself allows you to fully split two groups of users into the desired proportions of 50/50, 20/80, etc. In addition, users fall under these groups independently of any other factors.

We’ve implemented the following procedure:

● We take the md5 hash based on the experiment name and user ID.

● We then take 8 low-order bytes, take the remainder after dividing by 1,000, and divide the result by 10. This gives us a percentage accurate to one decimal place. This percentage determines the group the user belongs to.

● The function returns a number between 0.0 and 100.0. When calling an experiment in an expression such as “experiment(test,40,85.5),” the function will return “true” if the user falls between 40.0 and 85.5.

Transferring Configurations to the Client

After we create the required features and experiments, we need to send this information to the client somehow. To do this, we use a special service called Dynamic Config, which we use to transfer the necessary configurations to the required devices. For instance:

Waterfall Configurations

At the very beginning, when the user has just installed the application, we still don’t know anything about them. However, we learn more and more about the user throughout onboarding. The decision to send them to a certain experiment (for example, to a new promo screen) or not must be made quickly. There isn’t time to wait for the server to send a new config (there may be internet connection issues, or the server may “change its mind”). We need to strike while the iron’s hot, as the saying goes.

To do so, we came up with so-called “waterfall configurations” that differ from the usual kind in that the conditions for entering the experiment follow one after another with this type of configurations. For instance:

As we move through the interface, we learn more about the user and try to apply new rules to him/her each time. If one of the rules is triggered as the “waterfall” progresses, the user ends up in one of the specified experiments.

Experiment Dashboard

The Experiment Dashboard is recalculated and updated regularly throughout the day. The calculation uses a proprietary ETL that operates on an hourly basis. The dashboard allows you to view a list of running and stopped experiments, track uplifts for metrics and statistical significance indicators (p-value), and see the sample size and which group each experiment belongs to.

You can also go to each experiment on a separate page showing the changes in p-value based on the cumulative calculation.

Calculating the Results of the Experiment

The ETL itself is based on Spark (Scala). Scala and Spark are generally the de facto standard for Flo when developing such solutions. For some ETLs, it’s easier to use Python, Pandas, and PySpark since analysts use Python in their work.

Airflow is a clear, convenient, and reliable tool that we use to run these tasks on a schedule.

Our ETL works as follows: all metrics are described in the form of SQL, which calculates a value (for example, CTR or some other conversion indicator). It looks like this:

We use Presto on top of data from AWS S3 as our main database. This is a powerful and robust distributed SQL query engine that runs on top of terabytes of our own raw data.

The p-value is then calculated as part of a parametric or a nonparametric test. This is a chi-square by default, and the final dashboard calculates the significance with this statistical criterion by default.

If this is insufficient, we can use the bootstrap (each analyst decides for themselves). We also have two environments deployed for Data Science: JupyterHub and Amazon SageMaker Studio. These allow us to load the required dataset from the experiment for advanced analysis. Of course, a single system with a single tool would be better. We’ll do this when we have enough feedback about these tools.

What We Think Can Be Improved

First, we want to save our analysts from doing rote work as much as possible and to learn how to conduct experiments with minimal involvement from analysts. Since not a single change or feature is rolled out without experimental testing, this takes a lot of time, resources, and effort.

ETL has already outlived its usefulness as a tool for calculating experiments. Experiments take too long to calculate, and adding new metrics is extremely difficult. We need to make this process more flexible. For example, we can randomly add metrics in the form of SQL queries that are then executed using Presto.

Looker lacks the functionality and flexibility to display experiment analyses. Developing a proprietary UI will allow us to act independently to visualize the results of experimental analyses and conduct the experiments themselves.

We use the fixed-horizon approach to calculate the duration of experiments and the size of audiences. While this is a traditional approach to conducting experiments, it’s already largely outdated since there are already more advanced (but also more complex) experimental methods. Take, for example, the sequential approach.

Why We Don’t Use Third-Party Services

● We take great care of our users’ personal data and never provide such data to third parties. Since we’ve already implemented the feature-toggling mechanism (as with internal analytical storage), conducting experiments was the next logical step.

● They lack flexibility and transparency. To validate the results of an experiment, we need access to raw data and methods of calculation. We can’t rely on data that we can’t trust.

● Third-party services don’t offer much benefit since we have our own talented analysts and data engineers. The biggest challenge in conducting experiments generally involves the following two tasks: you need to correctly divide users into groups and correctly assess the significance of the experiment (taking into account all of its specific characteristics). If these two tasks are essentially solved, a third-party service would just be another possible implementation of what’s already working.

Conclusion

Conducting experiments and developing your own service to do so is a difficult and intense endeavor. The aforementioned approach to conducting experiments and performing the corresponding calculations is neither an exclusive one nor the only correct one. However, the examples given above should be of use to anyone who wants to create their own system for experimentation.

How Flo Conducts Experiments

Written by Konstantin Grabar