A/B testing with practical examples

Rakeshbobbati
7 min readMar 17, 2024

--

Experiment design and result analysis post running experiment with practical examples.

A/B testing is a methodology for comparing two versions of a product against each other to determine which one performs better. A/B testing is essentially an experiment where two or more variants of a product are shown to users at random, and statistical analysis is used to determine which variation performs better for a given conversion goal.

A/B testing helps a data scientist to answer does a change (X) in product causes any improvement on business metric (Y)

General approach for A/B testing

Assume that a we decided to run an experiment on our blogging website where users can subscribe for reading our blogs. We changed the subscribe button position from bottom right to top right. Now we want to test, does changing that button position has improved our subscriptions or not.

Example 1 : Assume that landing page of our blogging website has 160000 monthly visitors among them 800 click the subscribe button (CTA = Click through action) and finally 500 of them later converted to paying accounts. You would like to roll out the experiment to 30% of users who visit our page monthly. How this experiment can potentially influence Monthly Recurring Revenue (MRR) if Average Revenue Per Account (ARPA) is 50 rupees.

  1. ARPA = MRR/ (number of paying users)

MRR = 25000 i.e ( 50 x 500 )

2. Experiment Sample size calculation as we decided to roll the experiment on 30% of population sample size will be 48000 from total 160000. Now we show our old landing page where button is at the bottom to half the population which we will define as (Control group). and other half to our new landing page where bottom is at the top right (Test group).

3. Statistical Specifications

  • Null hypothesis: There is no true difference in the conversion rate (CR) of the Control group & the Test variant.
  • Alternate Hypothesis: Test variant will convert better than the Control group.
  • p-vlaue: Usually assumed as 0.05 i.e. Confidence Level = 95% which gives type 1 error <=5% (alpha) , This is Ability to reject true negative. This tells us that we can say with a 95% confidence interval that the Test variant actually converts better than the Control group and the results are not by chance.
  • Power: Usually assumed as 80% with p-value=0.05. type 2 error <= 20%(beta). which is Ability to identify true positive. There is still a 20% probability that that the Test group does not convert better than the Control group.

4. Scenarios

  • Here, we will be creating scenarios assuming a certain minimum detectable effect (MDE) in the landing to paying account conversion rate.
  • We will be calculating the minimum experiment duration for these different scenarios.
  • We are assuming that the data is normally distributed as we have a significant number of conversions.

Scenario 1 : MDE = 10%.

St dev = SQRT(CR * (1-CR)/N) .

Sample Size (N) = 2 * total number of variations * (Z1 + Z2)² * (STD/CR* Uplift)²

Z1 = z score for type 1 error = 0.05. For 5% significance, Z1 = 1.96

Z2 = z score for power. For 80% power, Z2 = 0.84

Now this is an unrealistic scenario. To be able to detect a minimum 10% uplift, we would need to run the experiment for at least 4 months (208414/48000).

This is so because the BAU conversion rate is lower. It will be quicker if we increase the MDE

Scenario 2: MDE = 25%

This scenario is realistic. To be able to detect a minimum of 25% uplift, we would need to run the experiment for at least 1 months

5. Conclusions

  • We need a minimum sample size of 33,347 users visiting the website in order to be able to detect a minimum uplift of 25% in the conversion rate with a confidence interval of 95%.
  • To achieve this sample size requirement, we need to run the experiment for at least 1 month
  • This experiment has the potential to increase MRR by rupees 6,250 i.e. 25% if rolled out to 100% users.
  • This translates to an incremental 125 paying accounts over the current run-rate of 500
  • MRR = CR*50, current MRR = 500*50 , Potential = 160000*(0.4/100)*50.

I would launch the experiment as it seems like a possible scenario that we can increase the conversion rate by 25%.

Know that if we have multiple changes. In our case if we change the colour of the button or shape of the button along with position. we will know that the B variant has increased the CR but we cannot know that which change contributed to the change. We need to keep other values as constant to know which change contributed to the increase in CR

We can adjust parameters like Target audience to reduce MDE to suit our needs. We can leverage p-value also but bear in mind that if we adjust p-value the beta value changes accordingly as they are inversely proportional to each other. Below figure shows what alpha, beta and power look like in a graph and illustrates some of the relationships between them. Remember, alpha and beta represent the probabilities of Type I and Type II Errors, respectively.

Example 2: Hypothesis testing with Python code.

Let us assume that we have 2 car pricing models which give used car prices for 10 cars. Now we have to detect which model is giving better prices.

from numpy.random import normal

car_price_model_A = normal(100, 16, 10) # mean = 100 ,std = 16 ,count = 10
car_price_model_B = normal(125, 16, 10) # mean = 125 ,std = 16 ,count = 10

# Randomly generated values for this time are
# A = [113.0874069904511, 95.6701406433734,123.40198276194985,79.38701926616444,
# ,123.64545796310011,122.12755280250579,122.57846952020384,110.48966803028092,
# ,110.37107839196443,87.92792809901299]

# B = [111.5163868976799,116.1807367689398,151.49173056069063,153.03420602326733,
# ,130.70767579533174,153.26516127915156,118.5982270794405,141.76294271181928,
# ,149.5273122296966,156.26657689305398]

From the above code we know that B model has better average price. Let us do t-test for the above data.

import scipy
import pandas as pd
dict = { "car_price":list(car_price_model_A)+list(car_price_model_B)
, "group": [ "A","A","A","A","A","A","A","A","A","A",
"B","B","B","B", "B","B","B","B","B","B"]}
df = pd.DataFrame(dict)

import scipy
t_stat, p_value = scipy.stats.ttest_ind(df[df["group"]=="A"].car_price, df[df["group"]=="B"].car_price,
axis=0,
equal_var=True,
nan_policy='propagate',
permutations=None,
random_state=None,
alternative='less',
trim=0)
print(t_stat, p_value)

# Output is: t_stat = -3.9296390209612126
# p-value = 0.0004911735271013565
# The p-value is < 99 so price of model A being lesser than prices of B are
# most likely true.
# Note: if 'alternate' = 'less' means A < B if p-value is at extreme tails < 95%
# if 'alternate' = 'greater' means A > B if p-value is at extreme tails < 95%

The test result are inline i.e B prices are better than A prices

This t-test is only done if the distributions of both A and B are normal distribution with similar std values. What if std are different for A and B models. Then we perform Welch’s T-test.

import scipy
t_stat, p_value = scipy.stats.ttest_ind(df[df["group"]=="A"].car_price, df[df["group"]=="B"].car_price,
axis=0,
equal_var=False,
nan_policy='propagate',
permutations=None,
random_state=None,
alternative='less',
trim=0)
# parameter 'equal_var' = False for Welch's T-test. means A and B has different
# variations

Shapiro-Wilk test is a test of normality, it determines whether the given sample comes from the normal distribution or not.

from numpy.random import poisson
from scipy.stats import shapiro

data = poisson(5, 200)

# conduct the Shapiro-Wilk Test
shapiro(data)
# ShapiroResult(statistic=0.9628055095672607, pvalue=4.027883187518455e-05)

Since tests p-value is less than the alpha(0.05) then we reject the null hypothesis i.e. we have sufficient evidence to say that sample does not come from a normal distribution.

from numpy.random import normal
from scipy.stats import shapiro

data = normal(125, 16, 10)

# conduct the Shapiro-Wilk Test
shapiro(data)

# ShapiroResult(statistic=0.9537200331687927, pvalue=0.7125772833824158)

The p-value is more than the threshold(0.05) which is the alpha(0.05) then we fail to reject the null hypothesis i.e. we do not have sufficient evidence to say that sample does not come from a normal distribution.

What if both A and B doesn’t follow Normal Distribution or std is different. Then we do Mann Whitney U test.

from scipy.stats import mannwhitneyu 
# Take A and B data as
A =[3, 4, 2, 6, 2, 5]
B =[9, 7, 5, 10, 8, 6]

# performing mann whitney test
stat, p_value = mannwhitneyu(A, B)
# Level of significance
alpha = 0.05
# conclusion
if p_value < alpha:
print('Reject Null Hypothesis')
else:
print('Do not Reject Null Hypothesis')

# stat=2.00, p_value=0.01
# Reject Null Hypothesis

If you find this article helpful to you, please follow to inspire me to publish related blogs on data science and statistics.

Thanks for your reading and feel free to leave comments.

--

--

Rakeshbobbati

Love to derive Business Impact through Data Science and Analytics. A data Scientist with a 6 years of experience.