Mindful Experimentation: Evaluate Recommendation System Performance using A/B Testing at Headspace
Author: Rohan Singh Rajput. Rohan is a Senior Data Scientist at Headspace. He combines his passion for Machine Learning with Causal Inference to improve the mindfulness and meditation practices of Headspace users.
“If you can’t measure it, you can’t improve it.” — Peter Drucker.
Content Customizer is Headspace’s personalized recommendation system. Content Customizer uses historical data to train its machine learning model and provide personalized recommendations to the users, helping them discover more relevant content. There are various components involved in building a recommendation system, and the ML model is only one of them. Therefore, it is essential to evaluate the effectiveness of the recommendation system as a whole before deploying it to production. Online Controlled Experiments help us to assess our system’s impact with statistical evidence.
Online Controlled Experiments, a.k.a A/B testing, are the gold standard for estimating causality with high probability. A data-driven decision-making culture helps estimate the measurement’s uncertainty to refute the null hypothesis based on experimental data. Furthermore, a random assignment of the users into a control-treatment group allows us to safely ignore the unobserved factors and model the parameters as random variables1.
The following components are required to design the experiment.
Recommendation Surface Area: We have a total of three surface areas for this experiment:
- First, in the Today tab, we will use the last three slots to display ML recommendations.
- Second is the Hero module, which is the top banner area of the Meditate/Sleep/Focus/Move tabs.
- Lastly, we will be using the recommended sub-tabs that also have three slots each for ML-powered content.
- In total, we have 19 places available for the experiment.
The platform for Recommendation: Headspace serves on multiple platforms, including iOS, Android, and Web, to deliver its content. Depending on technical and business requirements, we may select different platforms for e experiments. For this experiment, we are choosing Android and iOS.
Audience Selection Criteria: Machine Learning models are trained on historical data. Keeping the distribution of characteristics in the training set similar to that seen during inference is vital. There are various criteria on which we define our intended group of users for whom we design our model and experiment (based on the ML model objective and business requirements).
- Type of users: B2C, B2B
- Subscription type: Paid, Free trials
- Language: English, German, French
- Region: North America, International
Driver Metrics and Business KPIs: To evaluate the experiment’s effectiveness, we need a driver metric that helps us quantify the effect. There are multiple metrics we can use to measure the impact of the ML model. Some of these metrics are provided below:
- Content Start
- Content Complete
- Unique Content Start
- Unique Content Complete
- Content Complete to Content Start ratio
It is essential to consider the business KPIs that the driver metrics will impact. These KPIs help us estimate the business impact by establishing a coupling with driver metrics. A few examples of such KPIs are:
- Revenue per employee
- Daily/Weekly/Monthly active users
- Free trial to paid conversion rate
- Long Term Value
- Daily/Weekly/Monthly churn rate
We should note that our driver metrics should be as close as possible to business KPIs. They are collectively known as Overall Evaluation Criteria.
Sample Size and Experiment Duration: There are various ways to measure the sample size required for statistical significance. It usually depends on the type of experiment we are performing, e.g., A/B or A/B/n. Generally, the required sample size increases with the following factors:
- An increasing number of variants in the experiment
- An increasing number of KPIs and Driver Metrics
- Decreasing the effect size of the experiment (Highly sensitive experiment)
- Decreasing the statistical power
- Increasing the desired confidence level
Once we have finalized the given parameters, we randomly bucket the users into two groups: the control group and the treatment group. The control group gets static editorial content in all of the slots. In contrast, the treatment group displays the personalized content from the machine learning model to the 4th, 5th, and 6th spots of the Today tab and within the Hero-Recommended module for the other four tabs.
Experimentation requires robust infrastructure to scale to a more extensive user base. From an infrastructure point of view, we need a platform for appropriately administering experiments according to the various needs of our client app and the media we serve. The content customizer model uses the below architecture for experiments. The details of the architecture component are the following:
- Content Customizer Model: A Machine Learning model that generates personalized recommendations for every eligible user. It sends its prediction to the prediction service using offline inference.
- Prediction Service: A microservice that sends predictions to the layout service asynchronously. It is required to update the layout service after post-processing of the content.
- Layout Service: A bundle of services that creates a layout for the client for respective tabs. It receives requests from the client and serves appropriate content to the layout.
- Optimizely: Optimizely is an online A/B testing platform that helps to administer the online controlled experiment. It randomly buckets the eligible users into control and treatment. Optimizely uses feature flags to assign users into different buckets and maintain several statistical assumptions for the experimentation.
The evaluation of A/B testing is divided into two parts:
- Statistical Significance: Once we reach the calculated sample size and experiment duration, we can evaluate the test with the following information:
- Effect Size: We need effect size from all the variants to evaluate the test. Generally, this value is the difference between a consumption average (average content completes/starts/plays) or the conversion rate (number of successes per trial). We sometimes also use relative differences in terms of percentage to measure the effect, for example, 10% lift/drop in content start/complete/plays event.
- Number of KPIs or Metrics: It is highly recommended to keep this number as low as possible. We usually use only one KPI or driver metric (average content start/complete/play) for the evaluation; this metric should be very close to the business metric (retention/engagement/churn). We can increase the number of KPIs for an experiment. However, we must adjust the statistical estimates (p-value and confidence interval) accordingly to reduce the type I error probability.
- P-value: The p-value is defined as the probability of observing the effect if the null hypothesis is true. It helps us to reject the null hypothesis. In the standard A/B test, we use 0.05 as the threshold to refute or fail to reject the null hypothesis. However, this value is subject to various conventional assumptions; if we have more sensitive tests, multiple variants, or multiple KPIs into consideration, we have to adjust this value using different p-value correction techniques. The most common p-value correction method is Bonferroni correction. The following table has multiple p-values for each experimentation tab.
- Confidence Interval: A confidence interval provides a degree of uncertainty in the treatment effect. It offers a range to represent the confidence level of how often this interval should contain an actual treatment effect. The measured P-value along with CI helps us to refute the null hypothesis. A narrower Confidence Interval has much more reliability than a wider one. It also allows us to understand the different segments of the problem. The following example provides different Confidence intervals for every tab. Any part that is not crossing the blue vertical line at 0 and lies in the positive number line (Meditate, Focus, All Tabs) can be used to refute the null hypothesis.
2. Practical Significance2: In a business setting, it is crucial to set some guidelines about the success criteria. There are various business goals we can incorporate when we formulate the experiment. Practical significance boundaries help us to encode such constraints in the experiment. Many experiments reach statistical significance with convention type I and type II error settings. However, not every experiment offers a significant impact to get a green light for implementation. Various factors like cost, effort and long-term value are involved in decision-making, which justifies these tradeoffs for the greater good. Establishing a practical significance threshold is highly recommended for every project stakeholder, including engineers, product managers, and the leadership team. A practical significance threshold helps us to select the most feasible model for production implementation or rollout. We can include the practical significance threshold in confidence interval as well as in p-value. One example of practical significance is given below; here, we have taken an entry of a minimum of 2.5% (shown with the green line) to evaluate the experiment feasibility from an engineering and business point of view.
The final stage of the experimentation is decision-making; it is also the most challenging one. A business experiment is very different from a science experiment(1). We usually care about an iterative approach where the solution might be adequate for a few weeks to a decade in business. In contrast, scientific experiments are designed for much longer-term, for example, clinical drug trials. There are four main factors involved during the decision-making of an experiment.
- Risk: It is common to experience failure in an experiment. In reality, most experiments fail due to multiple external factors associated with the process, which makes an experiment very difficult to get right every time. If such experiments may negatively interfere with another user experiment, assessing that risk before launching is imperative. Even when we get satisfactory results, there’s a chance to have a false positive in effect. It could lead to the deployment of an erroneous model, impacting the business. There are several strategies to mitigate such risk. Some of them are listed below.
- We should always define some guardrail metrics for monitoring and alerting; we can not compromise these metrics’ performance during the experiment. Examples of such metrics are daily active members, the number of subscribers, renewal rate. We cannot compromise these metrics ‘ performance assessment if we see any guardrail metrics performance degradation. Therefore, we should immediately halt the experiment and investigate.
- To validate the infrastructure reliability, we can also run an A/A test before launching an experiment. A/A test consists of both control groups bucketed into two variants, and we should fail to refute the null hypothesis.
- Sequential tests provide flexibility to turn off a successful or futile experiment early to cut the sample size; however, it comes with its complexity of interpretation and implementation.
2. Reward: Every successful experiment has some rewards. Whenever we make a positive impact, it contributes to some improvement of driver and business metrics. The rewards can be divided into two parts: the small incremental value during the duration of the experiment and the other after launching the successful variant to a more extensive user base.
3. Cost: Experiments come with a price, whether it’s a monetary cost or an engineering effort. This cost has a vital role in decision-making. Many experiments produce statistically significant results, but the effect size is insufficient to justify the cost. Some implementations might involve complicated rollback processes like substantial code change, huge version updates, or signing a contract with a third-party application. We should take care of all these things during the decision-making process. The budget allocation should be flexible enough to handle some fluctuations within expected funds.
4. Benefit: Not only does experimentation come with direct rewards, but it can also indirectly benefit other ecosystems in an organization. Performing rapid experimentation builds muscle memory for a fast evidence-based culture where innovations are supported by reliable and trustworthy methodology. The experimentation meta-analysis reveals multiple important information about different data points, the scope of improvement, and the in-detail effect of various features. It also helps us improve the efficiency of numerous systems, understand the limitations of impact, and explore innumerable sectors for discoveries.
Various challenges come with an experiment, and these fall into several categories. Naturally, each component in a system has its limitations, and these create distinct challenges. Some of these are easier to solve than others.
- Infrastructure: We need many reliable components to administer the experiment. Sometimes we need end-to-end engineering support to make system-wide changes to implement the experimentation logic. Multiple ownership of services and features make coordination and communication an essential part of successful intended implementation. There are cases where we interact with even third-party applications and multiple APIs that make this task even more challenging. It is essential to have these various stakeholders on board when making these extensive changes.
- Platforms and Tooling: Another challenge in experimentation is choosing the right platform for the solution. There are multiple platforms available in the market to conduct and analyze an experiment. However, each platform has pros and cons. It is essential to evaluate the problem statement before choosing the right platform. Many of the current tools have limitations on adequately configuring the platform to the ecosystem. It has been observed that some of the tools have a black-boxed evaluation engine that could result in misinterpretation of results. It is also advised to build in-house solutions for the experimentation, but it comes with its trade-off of effort, skillset, resources, and timelines. We should consider all these points while choosing the right platform for the problem. It is also essential to have proper monitoring and alerting tools to watch the guardrail metrics. Sometimes a negative impact is far more concerning than the small positive lift. It is crucial to have a reliable monitoring and alerting system to prevent such incidence.
- Statistical Misinterpretation: This is one of the most common experimentation issues and is difficult to detect. It is easy to falsely reject a null hypothesis by committing a type I error, concluding that the treatment has an effect that it did not. It is also easy to observe correlation, but it is tough to establish causality. It happens because most extreme results are more likely to result from instrumentation errors (e.g., logging), loss of data ( or duplication), or a computational error(2). There are multiple ways to hijack or misinterpret the p-value(3). Other statistical misinterpretations result from lack of statistical power, peeking p-value, multiple comparisons, violation of stable unit treatment value assumption (SUTVA), survivorship bias, sample ratio mismatch, primacy effect, novelty effect, etc3. Therefore, we should always make sure that we perform a thorough meta-analysis to validate our statistical assumptions and prevent any obvious errors during our analysis.
Learnings and Future Roadmap
Online Controlled Experimentation plays a pivotal role in evidence-based, data-driven, and trustworthy decision-making. It is a battle-tested scientific tool for incremental improvement of the product. Although many challenges and investments are required to establish an experimentation culture, it pays back in the long term. Machine Learning is an essential part of data-driven innovation, and we can evaluate its impact by integrating experimentation into the process. Experimentation is not restricted only to evaluating Machine Learning models, but in general, we can use it for most impact assessments in an online setting. The process of experimentation requires help from multiple teams, so communication is also critical. We also need to be very careful during the result assessment to save ourselves from false discovery and unfruitful investment.
At Headspace, experimentation is a part of our innovation process. Our rapid experimentation framework allows us to perform quick prototyping of member-centric projects. In the future, we would like to continue to scale our experimentation framework to support all members’ needs and care. Experimentation will help us to provide mindfulness and mental wellness to our members by increasing the velocity of the model evaluation process.
- Statistical Methods in Online A/B Testing: Statistics for data-driven business decisions and risk management in e-commerce by Georgi Zdravkov Georgiev
- Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing by Ron Kohavi, Diane Tang, Ya Xu
- A dirty dozen: twelve p-value misconceptions Steven Goodman