Fundamentals to know before analyzing A/B test results

Shraddha Gupta
5 min readJan 27, 2022

--

source: https://unsplash.com/photos/fIq0tET6llw

In my previous post, I discussed steps that need to be taken before starting an A/B test. Let’s discuss some key concepts that help in analyzing the results and making the right conclusion.

  1. Null vs. Alternate Hypothesis

Let’s say a company wants to test the effect of changing the description of one of its products. They want to test it by doing an A/B test. The ultimate goal that they want to achieve is better user conversion. Let’s state the null and alternate hypothesis for this test. Null hypothesis (represented by H0) is there is no difference in conversion rate between test and control. Alternate hypothesis (represented by H1 or Ha) is that test has a higher conversion rate than control (for a one-tailed test) or conversion rate for test and control are different (for a two-tailed test).

In short, null hypothesis states that there is no difference between test and control. Alternate hypothesis states that there is a difference between test and control. This should always be decided before running an A/B test as it defines the type of test — one-tailed vs. two-tailed.

2. Type I vs. Type II error

source: https://www.pinterest.com/pin/4433299620057703/

Although, I briefly covered this in my previous post, let’s do a quick refresher.

Type I error means rejecting a null hypothesis when it’s true i.e. there wasn’t any difference between test and control but we conclude that there is a difference.

Type II error means failing to reject the null hypothesis when it's false i.e. there was a difference but we couldn’t pick it up.

3. P-Value

P-value doesn’t mean alpha value and it also doesn’t tell us the probability of rejecting the null hypothesis. P-value tells us the probability that the results that we got occurred by chance, assuming that our null hypothesis was true.

For example, if the alpha was set at 0.05 meaning, out of 100 tests, we are ok with 5 tests showing some difference even when there was none. And for the same test, if we get a p-value of 3% then that means there is a 3% chance that the results we are seeing occurred by chance.

To put it together, it means that out of 100 tests, there might be at max 5 tests where we may see some difference between null hypothesis and alternative hypothesis even when there was none and a 3% probability for any particular test that the results occurred by chance. P-value will be different for every test run.

Typically, if alpha is set at 0.05 then a p-value of <0.05 would mean that our results are statistically significant.

4. t-test

source: Author

A t-test is a statistical test used to compare the mean of two groups. It assumes that the distribution of two samples follows a normal distribution. Usually, a t-test is used when the population variance is not known or if the sample size is <30.

There are 3 types of t-tests —

a. One-sample — Compare the mean of the sample with a specific value

b. Two-sample or Independent t-test — Compare means of two groups that come from two different populations

c. Paired t-test — Compare means of two groups that come from the same population — measures a before and after effect

I am not going into the mathematical details of t-test. You can easily find that online on various forums. The output of the t-test gives -

a. t-value — Ratio of variance between two groups over variance within groups. A high t-value mean that the groups are different from each other and within groups, the variance is low

b. degrees of freedom — It basically tells us the number of data points available for comparison in samples. It is usually sample size -1

c. p-value — Probability of getting the t-value as large (absolute value) as we got, by chance

d. Confidence Interval — The range of numbers between which the true difference in means will be 95% (or the set confidence level) of times

e. t-critical value — The cut of value which is compared with t-value to decide whether the null hypothesis should be rejected. Typically, if the t-value is greater than the t-critical value, then the null hypothesis is rejected

5. Statistical Significance

When do we say that our results are significant? There are a set of conditions that can be checked

  1. The P-value should be less than the alpha value
  2. The upper and lower confidence interval should have minimal difference and should be of the same sign. If upper CI is positive and lower CI is negative, then it includes 0 and that means that our results for that metric lift could still fluctuate in any direction
  3. The cumulative trend of lift% for metrics should be flat for the last couple of days. This is to ensure that our results are stable and won’t change much even if we continue to run the test for another couple of days
  4. The lift% should not be much lesser than the minimum detectable effect that we care for our primary metric. If the lift% is much smaller than MDE, then even if our results are significant (based on other conditions), we may not conclude the test to be successful since the change is not big enough to invest in the launch cost.
  5. The daily trend of lift% of metrics should have the same sign on most days — sign test

Hope this was helpful as a starting point. I haven’t captured the details of these concepts here as each of these could be expanded into separate topics.

References:

  1. Udacity’s Course on AB Testing
  2. https://www.statisticshowto.com/
  3. https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test

--

--

Shraddha Gupta

Product Analyst | Hungry for learning something new everyday!