Using R to synthesise long-term findings with Meta-analysis at the BBC

Frank Hopkins
BBC Data Science
Published in
7 min readNov 29, 2019

Introduction

AB testing at the BBC isn’t just a Product priority, but also something that is regularly used by Editorial to improve the audience experience. Teams across the BBC use a variety of tools to improve our key performance metrics and experimentation is a key driver in enabling us to determine the best manner in which to present content.

Working in the Experimentation & Optimisation Team, the question we are most frequently asked by Editorial is: what do we learn from this?

For example, we may know (using our proprietary headline testing tool!) on a headline-by-headline basis that:

An island that never stops apologising > World’s most apologetic nation?

What we fail to understand about the 1% >How much do the super rich actually own?

How Donald Trump got to power > Why did American’s come out to vote for Trump?

But these individual events don’t give us generalisable and useful rules of thumb when writing headlines. What we may really want to determine from the above is, in general, which headlines perform better, questions or statements? And can we prove by how much?

In order to help teams derive long-term findings from experiments, we have been working on some bespoke tools to aggregate results across a series of similar experimental themes. Initially we used a one-way analysis of variance (ANOVA) to look at a series of experiments using the same variants. However, as the variance used in ANOVA represents the sample mean, this design is not particularly robust and is sensitive to outliers.

Any sizeable outlier may cause the sample mean to deviate from the true mean, and causes substantial variation. Given the editorial experiments we look to aggregate are often conducted weeks or even months apart, conversion rates and performance often varies significantly between experiments. Generating long-term learnings from such experiments at the macro level is therefore very difficult.

This is where Meta-analysis comes in.

Meta-analysis

Meta-analysis handles multiple experiments and calculates the combined effect and whether an applied treatment outperforms a control condition over a series of different assessments. It is most commonly used in the medical sciences.

Where in medical sciences Meta-analysis will assess the combined effect of an experimental condition against a placebo (control condition), the use-case of meta-analysis will vary when being utilised with data at the BBC, but will typically assess the performance of an experimental condition (variant) in respect to a no-change condition (control).

There are two recognised models utilised in meta-analysis, the fixed effect model and the random effects model. The two make different assumptions about the nature of experiments pooled for analysis, and these assumptions lead to different definitions for the combined effect, and different mechanisms for assigning weights.

Under the fixed effect model we assume that there is one true effect size which is shared by all the included studies. It follows that the combined effect is our estimate of this common effect size.

By contrast, under the random effects model we except that the true effect could vary from study to study. For example, the effect size might be a little higher if the subjects are older, or more educated, or healthier, and so on. The studies included in the meta-analysis are assumed to be a random sample of the relevant distribution of effects, and the combined effect estimates the mean effect in this distribution.

Because we cannot guarantee the demographic of our sample, the volume of unique visitors belonging to a certain demographic or the manner in which seasonality impacts findings with the majority of the data we handle at the BBC, the scripts developed by the Experimentation & Optimisation team use a random effects model to compute the Meta-analysis.

Our code

Below is some example code, using a series of experiments conducted by the CBBC editorial team, testing main character imagery vs. secondary character imagery on the CBBC homepage, over 9 individual experiments. All that is required of whomever is performing this analysis is that they input the findings from each experiment and save the relevant workbook in their local directory. This code encompasses both the meta()and scales() packages in R, which compute the statistical output of your Meta-analysis and forest plot, respectively.

For headline testing, a Python version of this code currently exists which uses the API of our AB testing vendor to pull in all data related to the theme that is being tested. Below is the R syntax used for both push notification and editorial image testing:

#### PLEASE FILL OUT THE "EXPERIMENT_DATA.XLSX" WORKBOOK AND SAVE IT IN YOUR LOCAL DIRECTORY ####

# Install correct packages for Meta-analysis

library(meta)
library(ggplot2)
library(stats)
library(ggpubr)
library(scales)

# Force naming convention/new dataframe name

madata <- experiment_data

# Creates new column to change "author" to "experiment"

madata$Experiment <- madata$Author

# Views and formats new dataframe

View(madata)
madataa <- madata[2:8,]

# View the new structure of your dataframe

str(madata)

# Run the Meta-analysis

m.bin <- metabin(Ee,
Ne,
Ec,
Nc,
data=madata,
studlab = paste(Author),
method = "i",
hakn = TRUE,
comb.fixed = FALSE,
comb.random = TRUE,
incr=0.1,
sm="ASD")
m.bin

# Averages for control and variant and the overall uplift

madata$conversion_control <- c(madata$Ec/experiment_data$Nc)
madata$conversion_variant <- c(madata$Ee/experiment_data$Ne)
control <- madata$conversion_control
variant <- madata$conversion_variant

uplift_calculator <- function(control,variant){
test <- mean(variant)/mean(control)-1
if(test >= 0)
{paste("The relative increase between the control and variant was", percent(test))}
else
{paste("The relative decrease between the control and variant was", percent(test))}
}

uplift_calculator(control, variant)
paste(mean(control), mean(variant))


# Forest plot

forest(m.bin,
rightlabs = c("Improvement","95% CI","weight"),
leftlabs = c("Experiment", "N","Mean","SD","N","Mean","SD"),
lab.e = "Variant",
pooled.totals = FALSE,
smlab = "Overall Result",
text.random = "Statistical Output (for Experimentation Team)",
print.tau2 = FALSE,
col.diamond = "purple",
col.diamond.lines = "purple",
col.predict = "purple",
print.I2.ci = TRUE,
digits.sd = 2
)

Output

Executing the above syntax provides you with a number of useful outputs. The first is the statistical output of your Meta-analysis. Here you can see it has analaysed the combined effect of all 9 experiments (k = 9), using the random effects model. This output also tells us whether the overall statistic for the Meta-analysis is statistically significant, which in this instance it is (p < 0.05):

Number of studies combined: k = 9

ASD 95%-CI t p-value

Random effects model -0.0565 [-0.0771; -0.0360] -6.34 0.0002

Quantifying heterogeneity:

tau² = 0.0005; H = 1.98 [1.42; 2.76]; I² = 74.5% [50.5%; 86.8%]

The overall, relative increase/decrease between the control and variant, and the individual mean performance of each condition is then presented:

> uplift_calculator(control, variant)

[1] “The relative decrease between the control and variant was -45.6%”

> paste(mean(control), mean(variant))

[1] “0.0441836083715 0.024015088989696”

This is then neatly tied into the forest plot which depicts the individual performance of both experimental conditions in each experiment, and the respective 95% confidence interval. The overall effect of the Meta-analysis is depicted as the purple, diamond-shaped point-estimate (the width of which represents the 95% confidence interval of the overall effect). As seen the overall effect is entirely to one side of the line of no-change (which is supported by the statistically significant output above). The forest plot below tells us that the control variant (main character imagery) was a clear winner. Given that the main characters of our CBBC shows are often the most popular, this may seem intuitive but there’s great utility in being able to conclude this to 95% statistical certainty. The individual weighting of each experiment is also presented:

Conclusion

Editorial teams across the BBC are running experiments in order to determine what our audiences engage with. Although these experiments are not difficult to implement or interpret on an individual basis, synthesising long-term findings from these experiments and the implications they have for our audiences can be problematic.

Although the average performance of a variant can be calculated and even presented graphically, determining what this means statistically is key. Step in Meta-analysis.

Meta-analysis is a powerful tool for assessing the combined effect of multiple individual experiments and can help teams across the BBC understand generalised learnings from experiments. This is essential in editorial testing, where it is often hard to derive long-term findings from individual tests. Whether we are determining the optimal length of words to use in headlines or which characters our audiences respond to best, Meta-analysis helps us draw finite conclusions about our experimental themes.

Selecting the correct model for your Meta-analysis dictates the manner in which the individual weighting of experiments is handled, and the random effects model is extremely useful where the effect size varies between experiments pooled in the analysis.

The code in this article provides a useful means to assess the combined effect of multiple tests to get to more generalised, aggregate conclusions.

--

--

Frank Hopkins
BBC Data Science

Experimentation Data Scientist, specialising in digital experimentation. Posts ranging from data science to website optimisation and digi-analytics.