A/B Test — Understanding causal relationships in experiments

Shubham Goyal
9 min readApr 27, 2023

--

In this article, we’ll be going through A/B tests in-depth to understand implementation and how they can help understand causal relationships in experiments.

Photo by Robert Anasch on Unsplash

Table of Contents

  1. Introduction
  2. Setting up A/B test
  3. Conducting A/B test
  4. Practical application w/ Example
  5. Conclusion
  6. References

Introduction

I remember the last time I conducted an A/B test at my job, I wasn’t fully aware of the nitty-gritty of how an A/B test needs to be conducted but had a basic construct in mind that there is some bifurcation between the test and control population and then we compare the performance using multiple metrics. The unaddressed question (which I presume is very important) at that time was if the difference in the metrics are significant enough to change the strategies in place. At that time we compared the metrics and tried to calculate additional metrics based on profit/loss to the company, but it lacked the statistical significance analysis which would have acted as proof to our derived analysis that was solely based on business logic.

In this article, we’ll go through in-depth how A/B tests are conducted and the statistical inference involved. As a data scientist, many times you would come across problems for which you don’t have much background knowledge, in such situations statistical tests come in handy to support your analysis and act as proof of concept.

Setting up A/B test

So what is an A/B test and what’s their importance? Say you’re working in a product team as an analyst and your manager hands you an analysis to differentiate user reaction to a new feature on the website. Essentially, you need to check if the new feature had an impact on user interaction and if it's beneficial to bring that change into production. In such a case, what comes first to your mind? I hope it is comparing some users with the old feature and some with the new feature and analyzing their behavior!

That’s it, that’s an A/B test for you! My job is done, Chao!

Conducting A/B Test

Just kidding, but on a high level, that’s what an A/B test is. We have two populations —

1) Test — population on which the new feature is tested
2) Control — a population that is restricted to an old feature

An important thing to note here is that we assume that only the feature that we are introducing affects the performance of that feature and is independent of any external interference. The experiment has to be conducted in a controlled environment, which basically means no interference from external factors.

Once we have the populations defined, we deploy the test to gather the user behavior on the change. Now, the timeframe of the test is upon our discretion and business logic, however, some standard time frames can be 3 months, 6 months, etc.

As we start getting the user behavior data our major goal to is develop a hypothesis if the change in feature affected the user behavior or not. If it did, was it significant enough or not? Which requires further testing of the hypothesis and drawing inferences based on the results.

Practical Application w/ Example

We’ll be using this data set available on Kaggle for our practical application.

Defining the problem

Suppose you’re working as an analyst at an e-commerce company and the product manager (PM) tells you that the current conversion rate (how many people buy the shampoo over how many people view the Web page) is about 13% on average throughout the year and that the team would be happy with an increase of 2%, meaning that the new UI design will be considered a success if it raises the conversion rate to 15%. So let’s dive into it

#importing the data, it can be downloaded from the above provided link
df = pd.read_csv('ab_data.csv')
df.head()
# To make sure all the control groupg are seeing the old page and viceversa
pd.crosstab(df['group'], df['landing_page'])
Crosstab for Test/Control groups

There are 294,478 rows in the data frame, each representing a user session. Before we go ahead and sample the data to get our subset, we’ll make sure there are no users that have been sampled multiple times and do basic sanity checks on the data.

session_counts = df['user_id'].value_counts(ascending=False)
multi_users = session_counts[session_counts > 1].count()

print(f'There are {multi_users} users that appear multiple times in the dataset')

users_to_drop = session_counts[session_counts > 1].index

df = df[~df['user_id'].isin(users_to_drop)]
print(f'The updated dataset now has {df.shape[0]} entries')

Formulating the Hypothesis

First things first, we want to make sure we formulate a hypothesis at the start of our project. This will make sure our interpretation of the results is correct as well as rigorous.

Given we don’t know if the new design will perform better or worse (or the same?) as our current design, we’ll choose a two-tailed test (assuming a 95% confidence level). To assess whether we have statistical evidence that the two pages’ conversion rates truly differ, we perform a hypothesis test:

- The null hypothesis that we want to test for is that the two pages’ conversion rates are equal

- The alternative is that they differ (one is higher than the other).

Or put in another way, the null hypothesis says that the factors page version and outcome are statistically independent of each other. In words, this means knowing which page someone is sent to, tells you nothing about the chance that they will have a successful outcome. Now that we know what hypothesis test we’re interested in, we’ll have to derive the appropriate test statistic.

Classical Frequentist Statistics

In statistics, an effect size is a number measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity.

Examples of effect sizes include the correlation between two variables, the mean difference, or the risk of a particular event happening.

We know that the current conversion rate is about 13% on average throughout the year and that the PM will be happy with 15%. So the effect size can be determined from these numbers. We need a standard way to convert it to a universal math factor.

effect_size = sms.proportion_effectsize(0.13, 0.15)    # Calculating effect size based on our expected rates

Sample Size for Test

Now, what’s the right sample size based on our significance level (alpha = 5%), effect size (above), and power (80%)?

We specify the three key components of power analysis:

  • A decision rule of when to reject the null hypothesis. We reject the null when the p-value is less than 5%
  • Our tolerance for committing type 2 error (1 − 80% = 20%)
  • The detectable difference, i.e. the level of impact we want to be able to detect with our test

Statistical power is the probability of rejecting the null hypothesis when it is false.

Hence for us to calculate the power, we need to define what false means to us in the context of the study. In other words, how much impact, i.e., the difference between test and control, do we need to observe in order to reject the null hypothesis and conclude that the action worked?

If we think that an event rate reduction of, say, 10−10% is enough to reject the null hypothesis, then we need a huge sample size to get a power of 80%!

That is because if the difference in event rates between the experimental group and the control group is such a small number, the null and alternative probability distributions will be nearly indistinguishable. Hence we will need to increase the sample size in order to move the alternative distribution to the right and gain power.

Conversely, if we only require a reduction of 2% in order to claim success, we can make do with a much smaller sample size.

required_n = sms.NormalIndPower().solve_power(
effect_size,
power=0.8,
alpha=0.05,
ratio=1
) # Calculating sample size needed

required_n = ceil(required_n) # Rounding up to next whole number
required_n
#gives a result of 4720

We get a result of 4720 observations which means we need at least 4,720 observations for each group.

Having set the power parameter to 0.8 means that if there exists an actual difference in conversion rate between our designs, assuming the difference is the one we estimated (13% vs. 15%), we have about an 80% chance to detect it as statistically significant in our test with the calculated sample size.

Now that our data frame is nice and clean, we can proceed and sample 𝑛=4720 entries for each of the groups.

We can use the pandas DataFrame.sample() method to do this, which will perform Simple Random Sampling for us.

control_sample = df[df['group'] == 'control'].sample(n=required_n, random_state=22)
treatment_sample = df[df['group'] == 'treatment'].sample(n=required_n, random_state=22)

ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)
ab_test

Visualizing the results

Judging by the stats below, it does look like our two designs performed very similarly, with our new design performing slightly better, approx. 12.3% vs. 12.6% conversion rate

conversion_rates = ab_test.groupby('group')['converted']

std_p = lambda x: np.std(x, ddof=0) # Std. deviation of the proportion
se_p = lambda x: stats.sem(x, ddof=0) # Std. error of the proportion (std / sqrt(n))

conversion_rates = conversion_rates.agg([np.mean, std_p, se_p])
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']


conversion_rates.style.format('{:.3f}')
Results of the A/B Test

We’ll also plot the results —

plt.figure(figsize=(8,6))

sns.barplot(x=ab_test['group'], y=ab_test['converted'], ci=False)

plt.ylim(0, 0.17)
plt.title('Conversion rate by group', pad=20)
plt.xlabel('Group', labelpad=15)
plt.ylabel('Converted (proportion)', labelpad=15);
Visualization of our A/B Test Results

The conversion rates for our groups are indeed very close. Also, note that the conversion rate of the control group is lower than what we would have expected given what we knew about our avg. conversion rate (12.3% vs. 13%). This goes to show that there is some variation in results when sampling from a population.

So… the treatment group’s value is higher. Is this difference actually statistically significant?

Testing the Hypothesis

The last step of our analysis is testing our hypothesis. Since we have a very large sample, we can use the normal (gaussian) approximation for calculating our p-value, i.e. we use the z-test.

If we had assumed a student-T statistic, we would perform a t-test.

We use the statsmodels.stats.proportion module to get the p-value and confidence intervals:

from statsmodels.stats.proportion import proportions_ztest, proportion_confint

control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']
n_con = control_results.count()
n_treat = treatment_results.count()
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')
print(f'ci 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'ci 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

# z statistic: -0.34
# p-value: 0.732
# ci 95% for control group: [0.114, 0.133]
# ci 95% for treatment group: [0.116, 0.135]

Since our 𝑝−𝑣𝑎𝑙𝑢𝑒=0.732 is way above our α=0.05 threshold, we cannot reject the Null hypothesis 𝐻ₒ, which means that our new design did not perform significantly different (let alone better) than our old one

Additionally, if we look at the confidence interval for the treatment group ([0.116, 0.135], or 11.6–13.5%) we notice that:

  • It includes our baseline value of a 13% conversion rate
  • It does not include our target value of 15% (the 2% uplift we were aiming for)

What this means is that it is more likely that the true conversion rate of the new design is similar to our baseline, rather than the 15% target we had hoped for. This is further proof that our new design is not likely to be an improvement on our old design, and that, unfortunately, we are back to the drawing board!

Conclusion

Hence, we conclude our learnings on A/B testing and practical application on a website UI data set.
The complete code in a colab notebook can be found here.

Ideas for future work

Comparative analysis b/w A/B tests and Multi-arm bandits based on Thompson sampling, as one approach is frequentist while the other is Bayesian. One important differentiator is the reduced amount of samples and trials in the Bayesian approach which makes it interesting, but we’ll go through this in detail in the coming article.

I hope you liked it and had some learnings, please show your support with a clap or feedback in the comments. If possible help me fund my medium membership by gifting one here, I would greatly appreciate the help.

References

--

--