Experimentation: tools and techniques

The BBC Experimentation and Optimisation Team discuss the scripting tools they have developed to assist products throughout the experimentation timeline

Published in

BBC Data Science

10 min readMar 24, 2020

At the BBC the Experimentation and Optimisation (E & O) Team operate as a team of experts, sitting within the central Data and Insights function. Although their primary efforts involve assisting with any one of the digital products at the organisation in running experiments, the team have also been working on productionising methods to harness more granular insight from testing data. Scripting tools have been developed using R to aid more detailed post-hoc analysis and to synthesise information from tests run across the organisation.

The E & O Team assist products throughout the entire experimentation timeline (Figure 1), from helping teams define hypotheses, building, QA’ing, launching tests and performing post-hoc analysis. The stages that will be discussed in this article are 1 and 4, covering the bespoke tooling used to understand experiment success.

Figure 1. The experimentation timeline at the BBC, including pre and post-hoc analysis phases assisted by the Experimentation and Optimisation Team

Sample size, statistical power and experiment duration

The most frequently asked questions the E & O Team get are:

“How big of a sample do I need to achieve significance?”
“How long do I need to run my experiment for?”

Although our AB testing vendor uses a sequential model and statistical significance is determined in real-time, it is important to answer these questions prior to conducting an experiment in order to assess both the efficacy and feasibility of an experiment. The team have developed some simple code that does this. All the user needs to do is pass in some baseline numbers to determine sample size requirements and experiment duration.

Luckily, by knowing a few simple pieces of information the pwr() package in R can answer these two questions easily. pwr() helps you perform power analysis prior to conducting an experiment, which enables you to determine how big your sample size should be per experimental condition.

The four quantities required to compute power analysis have an intimate relationship and we are able to compute any one of these values if we have the following inputs:

1. sample size (n)
2. effect size
3. significance level (alpha)= P(Type I error) = probability of finding an effect that is not there
4. power = 1 — P(Type II error) = probability of finding an effect that is there

As your significance level (3) and power (4) are typically fixed coefficients, as long as you can input the effect sizes (2) for your control and variant, you can determine your required sample size (1).

Thankfully, the ES.h() function in the pwr() package computes our effect size for us to pass into power analyses. We will typically know the current conversion rate/performance of our control condition but the effect of the variant is almost, by definition, an unknown. However, we can calculate an expected effect size, given a desired uplift. Once these effects are computed they are passed into the pwr.p.test() function which will compute our sample size, providing n is left blank. To make this sort of analysis user friendly these are both wrapped into a new function called sample_size_calculator().

Furthermore, as we will use this information to calculate the number of days needed to run the experiment, a days_calculator() function has been created, which will use the output from our sample size calculation:

sample_size_calculator <- function(control, uplift){
variant <- (uplift + 1) * control
baseline <- ES.h(control, variant)
sample_size_output <- pwr.p.test(h = baseline,
n = ,
sig.level = 0.05,
power = 0.8)
if(variant >= 0)
{return(sample_size_output)}
else
{paste("N/A")}
}days_calculator <- function(sample_size_output, average_daily_traffic){
days_required <- c(sample_size_output * 2)/(average_daily_traffic)
if(days_required >= 0)
{paste("It will take this many days to reach significance with your current traffic:", round(days_required, digits = 0))}
else
{paste("N/A")}
}

If you are using this tool, you simply specify your control conversion rate and desired uplift:

control <- 0.034567uplift <- 0.01

And run the sample_size_calculator() function:

sample_size_calculator(control, uplift)sample_size_output <- sample_size_output$nsample_size_output

You will then get your required sample size output given these values (remember this sample size requirement is per variant):

[n]230345

Now we have this information we can determine how long the experiment needs to run for. All that you will need to input is your average daily traffic:

average_daily_traffic <- 42000

Run the days_calculator() function:

days_calculator(sample_size_output, average_daily_traffic)

And you will get the following output:

[1] It will take this many days to reach significance with your current traffic: 36

Although this code is only relevant if you are conducting an experiment with an AB design (i.e with only two experimental conditions), the functions can be amended to calculate the required sample size given multiple experimental conditions, using the pwr.anova.test() function within sample_size_calculator(), replacing pwr.2p.test().

Power analysis is an important aspect of any experiment design. It allows analysts to determine the required sample size needed to detect a statistically significant effect of a given size, with a given degree of confidence. It also facilitates the detection of an effect of a given size with a given level of confidence, under sample-size constraints. If the probability is low, it could be advisable to alter the design of your experiment or to minimise certain numerical values that are input into your power analyses.

Your required sample and experimentation duration can be incredibly useful information to provide to stakeholders. This information can help them efficiently plan their experimentation road-maps. Furthermore, these can aid in determining the feasibility of certain experiments or whether the uplifts desired are too idealistic.

Post-hoc analysis

When an experiment is paused we may need to answer some additional questions. Perhaps a certain metric wasn’t tracked in our AB vendor or we want to see how a certain metric performed for a different demographic, age group or geographical region.

As we have an integration with a 3rd party analytics provider we are able to obtain this information with the correct experimentation metadata. We can answer the above questions and determine effects to statistical significance. We have two different types of performance metrics that are typically assessed at the BBC: conversion rates (which are binary metrics taken as an average for an experimental condition) and per browser metrics ( which represent how many times on average a user fired an event). Because these two metrics require different statistical treatment to assess the change between variants, the E & O Team have developed two different scripts which will be discussed in detail below.

Conversion Rate Metrics

A conversion rate represents the proportion of users that fired an event. This may be the percentage of users that clicked a button, completed a registration funnel or viewed a certain page. The script below emulates the logic and statistical tests used in online AB testing calculators you may be familiar with but uses packages in R to compute uplifts and statistical significance.

Firstly, install the necessary packages required for analysis:

install.packages (“pwr”)
install.packages(“scales”)
library (pwr)
library(scales)

The following ab_calculator() function is wrapped around a proportional t.test that will calculate the number of unique visitors in both experimental conditions, whether they fired the event in question and the uplift or lack thereof between variants:

ab_calculator <- function(control_uv, variant_uv, control_events, variant_events){
      ab_output <-  prop.test(c (control_events, variant_events), c (control_uv, variant_uv))
      conversion_rate_control <- control_events/control_uv
      conversion_rate_variant <- variant_events/variant_uv
      uplift <- (conversion_rate_variant/conversion_rate_control)-1
      uplift <- percent(uplift)
      if (ab_output$p.value <0.05 && uplift > 0){
         sprintf("There is a statistically significant difference between the control and the variant. During the test, the variant performed better than the default by %s percent. The mean of variant is %f and the mean of the control is %f", uplift, conversion_rate_variant, conversion_rate_control) 
      }else if (ab_output$p.value < 0.05 && uplift < 0){
         sprintf("There is a statistically significant difference between the control and the variant.During the test, the variant performed worse than the default by %s percent.The mean of variant is %f and the mean of the control is %f", uplift, conversion_rate_variant, conversion_rate_control) 
      }else{sprintf("No significant difference detected between the control and the variant.The mean of variant is %f and the mean of the control is %f", conversion_rate_variant, conversion_rate_control)}
}

All you need to do now is input the numbers of unique visitors and events for each condition and run the function:

control_uv <- 302412variant_uv <- 301951control_events <- 10000variant_events <- 12502ab_calculator(control_uv, variant_uv, control_events, variant_events)

And you will get the following output:

[1] “There is a statistically significant difference between the control and the variant. During the test, the variant performed better than the default by 25% percent. The mean of variant is 0.041404 and the mean of the control is 0.033067”

This code handles AB format, so if you have multiple experimental conditions you will need to compare them independently to the control variant as you would with any online calculator that uses a proportional t.test.

Per Browser Metrics

If you want to determine page views, sign-ins, page interactions or content consumption per browser, a different method is used to determine statistical significance. You will need to download independent data files for each of your experimental variants and the code in R will concatenate data and format it appropriately for analysis.

Firstly, have your files saved on your local machine and set this as your working directory in RStudio:

Install the necessary packages required for analysis:

install.packages(‘devtools’)
install.packages(“ggstatsplot”)
install.packages(“robust”)
install.packages(“ggplot”)
install.packages(“ggjoy”)
install.packages(“ggpubr”)
install.packages(“readxl”)
install.packages(“nloptr”)
library(ggstatsplot)
library(robust)
library(ggplot2)
library(ggjoy)
library(ggpubr)
library(readxl)
library(devtools)
library(nloptr)
devtools::install_github(“r-lib/rlang”, build_vignettes = TRUE)

Read in your data files and concatenate into one experiment data-frame:

Control <- read_excel('control.xlsx', col_names = c('Visitor_ID', 'Metric1', 'Metric2', 'Metric3'))
Variant <- read_excel('variant.xlsx', col_names = c('Visitor_ID', 'Metric1', 'Metric2', 'Metric3'))
Variant1 <- read_excel('variant1.xlsx', col_names = c('Visitor_ID', 'Metric1', 'Metric2', 'Metric3'))
Variant1$Variant <- paste("Variant1")
Variant$Variant <- paste("Variant")
Control$Variant <- paste("Control")
Experiment <- rbind(Control, Variant, Variant1)
Experiment <- as.data.frame(Experiment)
View(Experiment)

Create the remove_outliers function which will omit any data points > 3.5 standard deviations from the mean:

remove_outliers <- function (x) {
   y <- x[x > 0]
   outliers <- 3*sd(y) + mean(y)
   filtered <- x[x < outliers]
   valsremaining <- length(filtered)/length(x)
   if (valsremaining < 0.95){
      stop ("This function will remove more than 5% percent of your data. You need to remove outliers manually.")}
   
   else if (length(filtered)/length(x) < 0.99){
      warning("This calculation has removed between 1% and 5% of your data.") 
      filtered
   }
   else
   {filtered}
}

The following code uses ggstatsplot() which is “an extension of ggplot2 package for creating graphics with details from statistical tests included in the information-rich plots themselves” and utilised throughout academic hypothesis testing. The selected method is a between groups analysis of variance (ANOVA) as we are treating each experimental variant as an independent group:

ggstatsplot::ggbetweenstats(
   data = Experiment,
   x = Variant,
   y = Metric1,
   mean.label.size = 2.5,
   type = "parametric",
   k = 3,
   pairwise.comparisons = TRUE,
   pairwise.annotation = "p.value",
   p.adjust.method = "bonferroni",
   title = "AB/N Test",
   messages = TRUE)

Figure 2. ggstatsplot output using between groups ANOVA to determine statistical significance

The above output tells us the overall results of the experiment, with both the F and p value noted, and we can see that this test (using simulated data) has achieved statistical significance (p<0.05). As well as the overall effect, we can see the difference between the actual variants, so we can determine whether there is any benefit in implementing C over B, as well as B over A; this is computed using Bonferroni post-hoc comparisons.

If you wish to display the above in a more stake holder friendly fashion, use the following code and you will get a horizontal distribution output:

ggplot(Experiment, aes(x = Metric1, y = Variant, fill = Variant)) + 
   geom_joy() + 
   xlab("Metric1")+
   ylab("Variant")+
   ggtitle("AB Test")+
   theme_classic()

Figure 3. ggplot output presenting the distribution of number of events per browser for each variant

Although it is always advisable to track metrics in your AB testing platform and to have these metrics defined prior to launching an experiment, there will be times when stakeholders have ad-hoc queries and these tools can be put to good use. Conversely, there will also be times where it is not possible to track certain changes in your testing platform and you will need to query data or obtain it through your analytics provider; this is often the case with certain demographic and/or geographical segmented data. If this is the case the above code offers a good solution to ensuring you are determining changes to a statistical degree.

Summary

The overall mission of the BBC E & O Team is to up-skill embedded analysts and data teams so they are self-serving in their ability to perform pre- and post-hoc analysis for their experiments.

Although there will always be a degree of consultation between products and central data teams — such as our experimentation experts creating use — friendly tooling is an extremely useful manner in which to speed-up experimentation operations within products. The E & O Team have delivered training across the BBC to help analysts across the organisation understand the individual scripts and when they need to be utilised, which in turn frees up the team to focus on wider experimentation projects and roadmaps.

This article gives an insight into some of the work the E & O Team has been working on and scripts they have developed for teams across the company. Feel free to use any of the shared code should you wish to emulate this work!