Online Experiments 101: The Basics

Anton Martsen
Flo Health UK
Published in
6 min readAug 6, 2020

At Flo, we aim to improve the health and wellbeing of every woman in the world. To achieve this, we continuously improve the user experience by supporting women’s goals and needs through new product features and technologies.

One way to see how we move towards our mission is to test each app update in online controlled experiments (OCE). For the sake of brevity, we’ll just call them experiments from here on out.

For the last two years, we’ve developed our internal tech tools and organizational procedures for running and analyzing product experiments. The work never ends. At this point, Flo’s product analytics team has tested 322 releases (as of June 2020), and the number continues to grow.

We learn a lot and want to share our findings with readers who work in product teams and are starting to shape their own understanding of data-informed development. We’ve prepared a series of articles that describe our internal processes for improving our product through experiments.

But let’s start with the basics…

What’s an experiment?

Let’s say that we want to increase the share of users who enter their email and register in the app. There are plenty of possible solutions that could be done on the product side. For example, we can develop a pop-up screen that explains the benefit of registering in the app, such as the ability to quickly save and restore personal data when switching to a new smartphone.

Experiments allow us to measure the pop-up’s impact on the target metric. Any experiment involves a multi-step process in which we:

  1. Choose the target metric.
  2. Define the product hypothesis, which we then translate to a statistics hypothesis.
  3. Randomly split the app audience into several groups.
  4. Reserve one group where no changes are applied (the control group)
  5. Send the update to other groups. In our example, there is only one “test” (sometimes called “treatment”) group.
  6. Wait the required amount of time and do statistical analysis on the target metric.
  7. Finally, decide whether or not we roll out the update to production.

Our experience shows that people often confuse experiments with a step-by-step rollout. Remember, it’s not the same thing at all. A classic experiment assumes a constant percentage of group distribution over time.

If, for example, you ran an experiment today in which a test group amounts to 50 percent, then 50 percent should stay until the end of that experiment.

A rollout works the exact opposite. It involves a gradual increase in the percentage of new app version installations.

This behavior leads to an unbalanced “flow” of users from the control group into the test group. This test group can no longer be easily and quickly compared to the control group.

Why use experiments?

Why not just release a new version of the app with the changes and compare the new version to the old one? There are several reasons.

Change the user experience with minimal business risk

Sometimes we’re afraid to roll out a specific new functionality to a broad audience. There may be several reasons for that.

One reason, for example, may be that we’re not entirely confident in the technical stability and are afraid of increasing the number of app crashes.

There’s also a hypothesis that changes can “break” some metric of the app’s ‘health’, e.g., retention.

Experiments allow us to roll out changes to the minimum necessary audience to collect user behavior statistics without changing the experience for a large number of our users.

Iterative product improvements

Here’s an example. Flo’s product team started testing a new onboarding flow. At the data collection stage, they found a significant user drop-off on one screen.

The team decided that this screen was too complicated for users. They then concluded that it could be excluded from onboarding, and the necessary data could be collected later.

This change reduced the drop-off and allowed more women to start using Flo.

Highlight the impact of each product change

Flo’s product team runs an average of 10 parallel changes in each new version of the app.

Obviously, we want to measure the impact of each particular change, not the sum of all changes. One change may actually worsen the user experience and break metrics.

Let’s say that we run a new promo and a new onboarding in parallel. Both releases could change monetization and retention metrics. In this case, an experiment is the best way to understand how different product changes affect the metrics.

Faster and easier analysis

In our practice, analyzing the impact of new releases without conducting an experiment takes analysts about three times longer.

On top of that, the analysis of simple product changes can be delegated to engineers and product managers, which speeds up the feedback loop. It’s a win-win situation.

When and when not to run experiments

Flo’s product teams use different research tools to gather user feedback. It’s crucial to understand how experiments differ from other methods.

We treat experiments as a very precise but costly tool from a technical point of view. To run an experiment, you should already have a built product, deliver it to users, and wait for a while to collect the necessary amount of data.

It’s OK if you optimize existing solutions because engineering teams can quickly deliver updates to production. But experiments are hard to use if you’re only at the beginning of the new product development lifecycle. For cases like these, we use other methods and approaches.

How experiments run at Flo (in a nutshell)

The diagram below shows a simplified representation of the AB test.

We have an audience for the Flo app that we evenly divide at random. In the first half, we show the updated functionality (we hide the new functionality behind a special “feature switcher”) but don’t show it in the second half. The group that sees the new functionality is called the test group. The second group — the one that sees the app without any changes — is called the control group.

In this example, the behavior changes for test group users — for example, the registration conversion increases.

The primary condition for fair experimentation is fair randomization, i.e., the user must be randomly placed in the group.

Randomization is the most reliable way to create similar groups that can be compared with each other and to minimize possible shifts in the evaluation.

At Flo, we use our own service experiment run and analysis that supports hashing from the user id and experiment name. This guarantees that the user will only get into one particular experiment group.

Conclusions & Next Steps

In this part, we described what experiments are, their benefits, and possible use cases.

In our next article, we’ll talk about the experiment design process, which accounts for product, technical, and statistical nuances.

And last but not least, our post will provide a detailed description of the possible ways to analyze experimental data and how to avoid common pitfalls.

Authors: Anton Martsen, Senior Product Analyst at Flo Health; Dmitry Zolotukhin, Head of Analytics at Flo Health

--

--

Anton Martsen
Flo Health UK

Quantitative User Reseracher @ Flo Health Inc — UX, ML, Personalization, DataViz, Human-Centered Computing.