Experimentation 2.0: Navigating the upgrades at Flo Health

Published in

Flo Health UK

15 min readJan 4, 2024

At Flo Health, we truly believe that all decisions should be supported by data. When you make changes in products, you usually expect that your hypothesis is true. Usually, in product development, you first test your changes on a limited number of users, and when you’re sure that it’s good to go, roll out your new feature or other changes to your full audience. But how do you ensure that it’s safe? The most advanced approach in product management and analytics is to run all of your changes through A/B testing (or experimentation).

It’s been a while since we shared insights about our experimentation tools and process. You can read my previous article here. Our experimentation approach has become much more mature since the last time, and I’m happy to share the insights and findings we acquired.

The process

First, let’s start with our key metrics and goals. We start by evaluating the efficiency of the experimentation. But how do we measure that? I can suggest several variables. Initially, we need to be sure that we are able to launch enough experiments to succeed with our plans. For this, we measure the number of total launched experiments. We also track the ratio between successfully launched experiments and failed (success rate) but don’t use it as a KPI.

As an example of the market standard, we can rely on the experience of MAANG companies. For instance, in the blog How insights from people around the world make Google Search better in 2020, Google reported that only approximately 23% of its experiments are successful.

“In 2019, we ran more than 17,000 live traffic experiments to test out new features and improvements to Search. If you compare that with how many launches actually happened (around 3,600, remember?), you can see that only the best and most useful improvements make it into Search.”

We may explore additional sources, yet there is a significant likelihood that the scenario will remain largely unchanged. At the very least, we can glean a few key lessons and insights:

If failure is inevitable, then we can do our best to fail fast and to fail inexpensively (without spending a lot of resources and time on the experiment).
Product managers should work hard on the quality of their hypotheses. Of course, we can’t guarantee the success of the experiment. But it’s better to increase our chances and avoid wasting resources on every idea without any proof and reliability.

Even if it’s a key metric for the company, the experimentation platform can’t make your experiment successful. But it can help you run your experiments as fast as possible without any additional troubles. That’s why, for the experiment service, the north star is “the number of total experiments” or a derivative of the metric “the total number of treatments.” You can read more about this in an amazing article written by Ron Cohavi and Lukas Vermeer.

At the same time, we are experimenting quite a lot (for a company of our size). Our statistics show that the growth of the company correlates with the number of the experiments we’re launching and obviously with the number of teams that are launching these experiments. Nowadays, we launch from 100 up to 200 experiments simultaneously. By the end of 2023, we expect to have about 2,000 experiments launched this year. Millions of our users (almost all of them) participate in these experiments every day.

But how can this experimentation platform help us? We believe that fast and efficient experimentation is the key to success. Having a unified and common experimentation approach, including specific tooling and algorithms for all teams, leads the company toward a bright future and helps us be the #1 women’s health application.

Of course, besides the volume of the experimentation, we have to measure the pace and quality:

Average experiment time to market
Average experiment running time
Number of invalid experiments (detected by health checks and manual audit)

Check the article I mentioned above and go deeper into the articles of experimentation at Booking.com. You’ll find that it’s essential to keep the quality of your A/B tests high instead of blindly increasing the volume. But how do we achieve that?

Managing the experimentation workflow

To make this happen, we try to unify our efforts for the entire company via the same workflow and methods among our teams. The experiment service itself manages one part of it, and the other part is managed in Jira. The thing is that the life cycle of an experiment starts long before it appears in the experimentation system. When someone gets an idea to experiment on something, they first create an experimentation Jira ticket. Then, if it’s decided to check this hypothesis with an A/B test, the process goes further. You can synchronize every experiment with the original Jira ticket, and then the statuses are updated automatically.

What we’re trying to achieve is to fulfill our demand in tracking the entire life cycle and managing the process in a unified way. The efficiency and quality of the experiments are about how you control the chaos and how you keep high standards during growth. So, we believe that unifying the flow of A/B testing is the key to success here.

It also allows us to provide managers and analysts access to historical data and old experimentation data. Through this approach, we’re trying to have everything systemized. A/B test specification on Confluence, mock-ups in Miro, technical details in Jira, and relevant statistics in the experiment service are all connected together. Easy access to these assets allows our experimentation to move faster and easier.

Experiments impact

How do experiments actually impact company metrics, and what is the quality of your experimentation system? We mentioned that we track different statuses of our experiments, but the overall goal is to launch more experiments that uplift our primary metrics. The idea is simple, but the implementation isn’t so straightforward. There are a few things you need to think about.

In practice, it means that most of our experiments have a list of some default metrics (for each product team, in particular). Usually, the primary metrics for these directly or indirectly impact revenue, retention, or both. When you work with subscriptions, you can’t always wait for a few months to get your results. That’s why we use different proxy metrics, such as subscription retention. It saves time and allows us to run experiments faster. But it’s not enough.

Be aware of different mistakes and flaws people are making during the experimentation design and analysis, like SRM (sample ratio mismatch), multiple comparisons problem, etc. We run several health checks to detect and highlight the problems because if you make a mistake, you usually have no idea that you made one until you eventually detect it. Such health checks are very common for any experimentation platform. But you would be surprised if you knew how many people simply forget or ignore these mistakes and warnings. So, you need to be very careful and focused on keeping these dangerous mistakes and side effects under control.

It’s also good to have a so-called “holdout” group of users who aren’t participating in experiments and rollouts at all for some time. This is the only way we know to properly track the actual effect of our experiment on the users. Otherwise, how can you measure the real impact of your experiments? There is no guarantee that after the full-on feature rollout, you’ll see the same metric uplift as in the isolated experiment. There might be different reasons for that but holdouts allow you to control the long-term effects of your experiments instead of optimization for the short-term wins.

How do we calculate statistical significance?

Maybe the key part of the experimentation is answering the question: “Does this difference in metrics between control and variant group have statistical significance?” One way to answer it is to use a frequentist approach. You calculate the p-value and, based on the parameters of our experiment, make the decision whether we accept the variant group or reject it. (We prefer to use more academic terms, so we call it “null and alternative hypotheses.”). There are different statistical tests to do so. All of them have some pros and cons. It’s worth mentioning that all of these operations are performed automatically by our metrics calculation engine.

For instance, bootstrap is one of the most universal and powerful approaches, especially because it doesn’t care about the distribution of your data. However, it requires serious computational resources to do all calculations on millions of users. This is why we prefer to assess the distribution of our metrics, and based on their type and parameters, we choose the most appropriate statistical test according to our experience. But there is a good magic trick to reduce the load caused by bootstrapping. We use a process called “bucketing.” This technique allows us to group a bunch of users into a so-called “bucket” and then bootstrap the distribution based on that instead of working with the whole number of users. It seriously reduces the volume of data without a significant reduction in the accuracy or quality of the experiment. And besides that, it helps to make the distribution closer to normal.

If the metric is binary, then bootstrapping may not be worth it. In that case, we use a chi-squared test. If the metric is numeric, then we check its distribution with Kolmogorov-Smirnov test or Shapiro-Wilk test. Based on the results of these tests, we use Welch’s t-test or bootstrapping on bucketed values.

Pros of this approach:

All methods besides bootstrapping are easy to parallel and suitable for implementation on Spark.
This approach adapts to any size of the experiment group or metric type and gives a predictable time of calculation.
The bucketing feature won’t let the job overcome reasonable memory limits (which also leads to increased stability).

Cons:

The algorithm still doesn’t allow users to see intermediate experiment results and draw conclusions based on them without violation of the basic A/B testing rules (the “peeking problem”).
We need to be careful with choosing N and M values because the overall quality of the “bootstrap with bucketing” approximation depends on them (but honestly, we have never had problems with this in practice yet).
Lots of conditions might be tricky to implement, so it needs to be tested well.

Dealing with the peeking problem

One of the most challenging topics in experimentation is known as the “peeking problem.” Long story short, it’s happening when you’re trying to accept an alternative hypothesis optionally before the estimated end time of an experiment. For instance, our variant group success metric finally has a statistically significant result on the second week of an A/B test, but the planned time for this experiment was three weeks. By doing this, you’re dramatically inflating the chance of type I error. To avoid that, you need to make the decision over the experiment only at a specific point, which you calculate with a statistical significance calculator (not before and not after that point!). Sometimes, this is known as a “fixed horizon” approach.

You can see the example from our real experiment on the screenshot. The p-value crosses the threshold multiple times. If you make a decision in one of these places, you will quickly get into trouble.

First, we were trying to teach everyone how to use a calculator and how to make the right decision based on the pre-calculated “horizon”. But the problem here lies in human nature rather than in the complexity of statistical methods. Usually, it’s tough for people to avoid the temptation to accept the alternative hypothesis sooner if they see a low p-value or to wait longer until they get enough evidence to prove their thoughts.

The second problem here is that for guardrail metrics (or support metrics, as we usually call them), you also need to wait for the specific horizon. Otherwise, it’s also the same peeking as with the success metric. In case you want to stop an experiment early when you see your guardrail metric going down, you also can accidentally make a false-positive or false-negative mistake.

So, besides having a classic fixed horizon approach in place, we applied the method, which allows us to monitor the experimentation results constantly and eliminates the peeking problem. This kind of method is part of the so-called sequential analysis.

How does sequential analysis work?

We implemented one of the variants of the sequential testing approach, which is called “always valid p-value (AVPV)”. It’s not easy to explain how this approach works, so I prefer to leave the explanation to the creators of this method. Anyone interested in more details can follow the link and read the white paper. But these are the most critical points in the method:

We define the “decision border,” which becomes smaller and smaller as we get more users in our experiment.
If the difference of means between control and test variants exceeds this border, we accept the alternative hypothesis.
P-value, in that case, can only go down. If we accept the success of our variant once, it’s our final decision.

But in statistics, nothing goes without a trade-off. What’s very important to understand here is that sequential methods (especially AVPV) should be slower than fixed horizon if you assess minimum detectable effect (MDE) ideally (which happens very rarely in real life). Nevertheless, in practice, it works faster because, usually, it’s impossible to estimate MDE precisely. Of course, it doesn’t happen all the time, but the fact is that in real life, people tend to ignore fixed horizon approach limitations and peek p-value one way or another. So even when we pay with the speed of the experimentation, on the scale, it’s worth it. According to our simulations and the data of other companies, it allows us to gain up to a 20% increase in the experimentation speed due to the early stopping ability. That’s how we can rely on the results of our experiments, and it allows our product managers and other people with the lack of skills in statistics to work with experiments without additional complications.

Even still, analysts have the ability to turn on the “debug mode” and go through some additional information or check the calculations made by the fixed horizon method in the system. It’s not the ideal situation because jumping from one algorithm to another during the experiment analysis is some sort of peeking that violates the conditions of the experimentation process. But still, that’s an option if the analyst is ready to take that risk. The experimentation platform saves them some time by providing automation and extended analysis of the data.

This year, Spotify and Booking.com published excellent articles about the comparison of different methods of the sequential testing approach. For everyone interested in this I encourage you to read them thoroughly.

Controlling metrics calculation chaos

Another huge topic is metric calculation. Back in the beginning, our metrics calculation system was simple. As you can read in the previous article, we previously had a dashboard in Looker for all metrics calculations. It was pretty convenient to have all metrics in one place, but it has more cons than pros. There wasn’t any normal life cycle for experiments in the dashboard. Metrics for all of them, sometimes including those already stopped, were still being calculated. That caused an enormous waste of our computational resources. And one day, this dashboard just stopped operating due to an incremental load of resources.

Instead of scaling the current system, we decided to reconsider it completely. First, we introduced the “Metrics Repository”. This is the place where all metrics are listed, adequately described, and reviewed. All metrics have their owners, and we know whom to go to if something goes wrong or if we want to bump the version of one of them.

Under the hood, each metric is a Spark SQL query with some adjustable parameters (including macros). During the experiment creation, you can choose, for instance, the retention rate of which particular day you want to calculate (1,3,7,14, 30, etc.), and you don’t need to have several metrics for that. Sadly, it’s not possible to have just a plain “retention rate” without choosing a specific day, so you need to choose what days you’re going to analyze. But any time during the experiment, you can add more metrics on the fly and launch their calculation, which is a very powerful option, mainly when it works in combination with sequential testing.

Experiment service allows users to calculate dozens of metrics simultaneously and launch metrics recalculation any time they need it. All of the changes are going through the review process within the experiment service interface. We also communicate it publicly in a separate Slack channel to allow our analysts and data engineers to work together. It’s easy and transparent and gives us enough level of control. What I love the most is that now we know for sure how much money we spend on that. Once we have a separate Spark app and cluster for that, we can distinguish our spending on that from other SQL queries and different workloads. Besides that, we track the most popular metrics for our experiments, the most time-consuming metrics, etc.

The reason why it’s essential is not only the fact that everything in the experimentation is connected with the metrics. When the number of users in A/B tests is dozens of millions daily, this operation becomes very expensive and generates a severe load on the cluster. We implemented extensive monitoring, which allows us to identify the most problematic metrics that are not performing well and manage different parts of the metrics calculation system, including the statistical significance calculation phase.

Managing feature configurations

The last important topic is managing feature configurations. Often, this part is completely forgotten. But features and experiments are profoundly entangled and, in many cases, can’t be separated.

The life cycle of the feature is much longer than the life cycle of the experiment. To continuously experiment with your hypotheses, it’s better to avoid dependency on the experiment name in your code and depend on the feature instead. Moreover, using feature configurations in your A/B tests allows you to continuously experiment with managing everything remotely.

Feature configuration systems control the entire behavior of modern mobile and web apps. When you want to launch an A/B test, it very frequently requires changes in feature configuration to be made for the control and test group. And if your experimentation system doesn’t support this, then it becomes a problem. At Flo, we deeply integrated existing feature configuration services we use with our experimentation system. In most cases, it’s very inexpensive and easy to configure any type of user behavior we need at any part of the application (both mobile and web). My experience tells me that investing in improving your feature configuration system and better connection with the experimentation system is one of the key points to make your experimentation faster.

Moving forward with lessons learned

First of all, we’ll work on more integrations and automations and increasing our teams’ productivity. We believe that the company’s growth will be only possible if we reduce the number of mistakes in A/B tests caused by different problems, including bugs, incorrect experiment setup, and feature configurations. To improve these parts, we’ll keep investing in better experimentation management and optimizing our entire A/B testing flow. For instance, we have an interesting practice in some teams when we mark experiments that potentially require attention from analysts and which can be run without spending their time. Saving time for our analysts is one of the priorities, as it allows us to increase velocity.

Sharing knowledge between the teams is also a challenging topic. Since I realized that one team regularly spends their time resolving the issues that another has already resolved, I decided that it should be the next potential point of improvement. Experimentation service UI also represents the process and highlights preferred configurations and tips for end users. But in practice, it’s not enough, and you need to invest more time in collecting best practices and guides for your users.

Experimentation is a huge topic with many ways to succeed and fail. We constantly evolve and try to make our users’ experience better. Thanks for reading this! Please leave your comments and questions.