Steps to A/B Testing using MS Excel
A/B test experiment is basically an experiment with two groups A(control) & B (treatment) with typically a hypothesis that treatment is better than control. A/B tests are common in web designing and marketing for example publishers like NYT A/B test their web headlines to determine which produces more clicks.
Switching back to the process, ideally, subjects are randomly assigned or randomized to the treatment group so that it can be concluded that the difference between A & B is either due to the effect of treatment or luck of the draw where incidentally naturally better-performing subjects were placed in A or B.
An interesting question would be to ask, why have a control group, why not compare the results with prior experience? It is so that while comparing the two outcomes, you’re able to keep “everything else” to be the same. In case of comparing it with prior experience, there is a chance that “other things” might be different. For example, comparing number of clicks on a website for an experiment that ran on Monday with previous day’s data could potentially lead to wrong interpretations. It is because assuming “everything else” to be same on a weekday and a weekend doesn’t sound right.
Why not A/B/C/D? This type of experiment actually saves time and since it is performed with more than two groups it is called a multi-arm bandit.
Choosing a test statistic
A test statistic or a metric is used to compare group A to group B. Perhaps the most common metric in DS is a binary variable, in the case of NYT, it would be click or no click. So after conducting the experiment, we have a summary of clicks Vs no clicks data for each headline A and headline B. In this case, it is important to note that the metric is not a continuous variable (the number of clicks can only be a whole number i.e. 250 clicks, and never be 250.5 clicks).
Choosing a test statistic before conducting the experiment is recommended to avoid researcher bias and ideally choosing just one metric among many is ideal.
Hypothesis Test or Significance Test
A/B testing is generally done with a hypothesis in mind. To understand why looking at numbers to decide, if A is better than B, is flawed we need to understand the scope of natural random behavior. The idea is to prevent researchers from misinterpreting random behavior as a pattern of some significance or to avoid getting fooled by randomness.
So the hypothesis test is a further analysis of an A/B test to assess whether randomness is a reasonable explanation for the observed difference in A & B. If we’re able to rule this out then the only option left would be that the difference is due to the treatment and then successfully conclude the experiment.
Formulating Null & Alternate Hypothesis
- Rule one, taken together Null and alternate hypothesis must account for all possibilities
- The null hypothesis is testing the baseline assumption that all treatments are equal and all the difference is due to chance. We always want to reject our null hypothesis.
One way Vs Two way Hypothesis Test
Imagine that your website has blue click icons, and it is an established default option, however, you want to run an experiment to check if green icons give a better CTR. So basically you will stick to the blue icon unless proven otherwise (green is better). In this case, you would use a one-way (one-tailed) hypothesis test to protect you from being fooled by randomness in the direction towards the green icon.
If however, you want to do a hypothesis test to protect you from being fooled by randomness in either direction (blue or green) the alternative hypothesis is bidirectional or a two-way (two-tailed) hypothesis.
Even after following all the essential steps, your test result reports can be skewed by some errors that unknowingly creep into the process. Popularly known as Type I and Type II errors, these essentially lead to an incorrect conclusion of tests and/or erroneous declaration of winner and loser.
The probability of making a Type I error is denoted by ‘α’ and is defined as the probability of rejecting Ho when it is true. Type I error is correlated to the level of confidence that you decide to conclude your test at. This means that if you decide to conclude your test at a 95% confidence level, you are accepting that there is a 5% probability that your test result is wrong. One way to reduce this kind of error is by running the tests long enough to have enough samples before you conclude your experiment. Another way is to increase your confidence level, which might mean that you could increase type 2 errors instead, i.e., accepting Ho even though it is False. Also known as Beta (β) errors or false negatives where ‘β’ denotes the probability of making a Type II error.
Let’s look at a food delivery startup in the bar area called DoorDash. They work with various restaurants and have a sales team that has a process to convert potential leads into DoorDash clientele. They ran an A/B test with changes in the current B2B sales process, where they made a few changes in their sales process for half of the salesmen and the others used the same process. After six weeks, they collected the sales data for all of their sales employees and wish to measure the impact of the changes and decide whether to apply it or ditch it.
In week 1, DoorDash’s sales team with no changes in the sales process (control group) was able to convert 761 leads out of 1062 potential leads and. so on for other weeks.
Now let’s look at the treatment group numbers. In the same week, week 1, the team with changes in the sales process (Treatment group) was able to convert 442 out of 587 leads into clients. That’s definitely an improvement.
However, since the conversion number is so low, we need to check if this is just happening due to randomness. To rule out randomness let’s do a hypothesis test with a confidence interval of 95%. The metric for our test would be the conversion rate of the test group.
Let’s establish both our hypotheses:
Ho: Pc = Pt (The success rate of the Control group is the same as the Test Group)
Ha: Pc < Pt (The success rate of Control Group is lesser than that of Test Group)
Since hypothesis test essentially formally tests if the population the sample represents is normally distributed. The null hypothesis states that the population is normally distributed, against the alternative hypothesis that it is not normally distributed.
If the test p-value is less than the predefined significance level, you can reject the null hypothesis and conclude the data are not from a population with a normal distribution. If the p-value is greater than the predefined significance level, you cannot reject the null hypothesis.
Note that small deviations from normality can produce a statistically significant p-value when the sample size is large, and conversely it can be impossible to detect non-normality with a small sample. You should always examine the normal plot and use your judgment, rather than rely solely on the hypothesis test. Many statistical tests and estimators are robust against moderate departures in normality due to the central limit theorem.
Let’s design the experiment in MS Excel for Week 1 to rule out that possibility.
To be able to test our Ho, we need to design a normal distribution for 6 samples (six weeks). For that, we need essentially two things: Mean and Std. Deviation. Since mean and std dev. is likely to be different for each week, we will have to rebuild our distributions six times for all 6 weeks.
To do this in excel let’s define the sample size (N) to be of the treatment group which is 587, and pick the conversion rate(Po) from our control group which is 71.6573%. Using these two numbers let’s calculate our mu or population mean which will further be used to calculate variance and then finally the std. deviation.
Mu = N * Po
Variance = N*Po*(1-Po)
Standard Deviation = SQRT(Variance)
Build the normal distribution for Week 1using the norm.dist() function. it calculates the confidence level of the normal distribution using the values N, mean & standard deviation we just calculated above.
Another important step is to choose the value of alpha or the probability threshold of unlikeliness, that a random chance outcome must surpass for actual outcomes to be deemed statistically significant. Let alpha be 5% for this example.
With the confidence of 97%, we can conclude that the better conversions in the treatment group are not within a range of normal chance variation. Hence chance is not responsible for the improved numbers and hence the difference is statistically significant.