A/B Hypothesis Testing Explained Using R

Samples, Hypthothesis, P-value, Significance, Errors, and Tests

Marco Basile

Published in

Analytics Vidhya

12 min readSep 26, 2020

Prelude

In the last article, I talked about the importance of using descriptive statistics for your data.

If you didn’t read the last article, I highly suggest you do it.

Over explaining 8 famous descriptive statistics concepts, some are taken for granted in this article and they won’t be explained again.

Hopefully, you learned a lot just by looking at the practical examples.

If you do it yourself using R, you’ll learn much faster.

Now, let’s dive into the second article.

Ever heard of CRO?

It stands for conversion rate optimization and it’s one of the branches of digital marketing nowadays.

It’s based on the hypothesis that your website might perform better and take action on that.

CRO consultants tweak buttons, product messaging, copy, design, and more to increase your conversion rate.

If you’re a math geek and you’d like to get a sneak peek into the CRO world, this article is for you.

If you’re a CRO consultant and you’d like to know how to use R for simple A/B testing concepts, this article is for you.

If you’re just curious and you want to learn, this article is for you as well.

We’re gonna explain the fundamental concepts of A/B testing using R:

Sampling, sample size, and Population
Hypothesis Creation and Null Hypothesis
Type I and Type II errors
P-Value and Significance Level
Confidence Level and Power
One Sample T-test and Multiple T-tests
Multiple T-tests problems and ANOVA

How Can A/B Hypothesis Testing Help you?

Let’s say that you work for an E-com website and you want to learn how the last UX redesign impacted your user engagement or your purchase rate.

You don’t know whether that’s positive or negative yet, but you have lots of data at hand.

That’s where hypothesis testing comes into the picture.

It’ll help you evaluate whether the change is real and durable, or it’s just a random fluctuation.

Hypothesis testing provides you with a framework you can repeatedly use to make informed decisions based on data.

The better your test results, the higher is your confidence level that changing your website has been a good choice.

We’ll talk about that later on.

Sampling, sample size, and Population

Before making tests, we need to know the fundamental concepts.

In statistics, a sample is a portion of the original population.

Why do we do this?

Think about it.

You want to know which is the average male height in the US.

Would you measure all of us?

You can, but it’s not efficient: it’d take you years.

Instead, analysts take a sample of the original population, through a process called sampling.

Keep in mind that there’s something called sampling error, and for a few reasons.

Take a look at the chart below.

Sampling Error — Publication Age and Count of Books Published for Top 100 Novel Authors in French

Do you think that the sample we’ve taken we’ll help us make accurate predictions on the given population?

Of course not.

But what if we take a sample like this?

Sample — Publication Age and Count of Books Published for Top 100 Novel Authors in French

You got it.

This sample will help us make accurate predictions on the data, as it almost represents the population.

We can’t talk of perfection in an imperfect world, but in statistics, everything is always a “hypothesis”.

How do you calculate the minimum sample size to take for the experiment?

It all comes down to three factors:

How large of a difference you want to detect
Confidence level
Power and variability

You’re probably familiar just with the first one.

We’ll talk about the other two ones below. Keep reading!

Hypothesis Creation and Null Hypothesis

Now that we found out what sample and population are, let’s talk about the hypothesis.

I remember how boring it was trying to coming up with hypotheses for triangles at high school.

In the real world, however, it’s all much more exciting.

Let’s make an example.

You think that changing your product image on your product page caused a lift in % of visitors adding the product to their carts.

Using common sense, you’d use the “true” hypothesis:

“The new product image causes a lift in conversion rate for the visitors-cart segment.”

In statistics instead, to make things less confusing, we’ll use the null hypothesis, which it’s the exact opposite:

“The new product image doesn’t cause any effect on the conversion rate for the visitors-cart segment.”

This helps us quite a bit, especially with errors.

Think about why Type I and Type II errors are always false and come up with your own hypothesis (null).

Write it down and we’ll see if you got it right.

Type I and Type II errors

We’ll now start using R, but let’s design an experiment first.

We want to know whether

“history and chemistry scholars are interested in volleyball at the same rates”.

We invite 100 history majors and 100 chemistry majors to join a volleyball team.

After one week, we check our subscribers.

39% of chemistry majors subscribed, while just 34% of history majors subscribed.

Since we’ve taken samples of our populations, we want to know whether this result is accurate enough to predict the behavior of an entire population or it’s just a sampling error.

We analyze our data again and we decide we should keep our null hypothesis since this is a sampling error.

In other words:

“The subscription rate for history majors is the same as for chemistry majors and any difference is due to sampling error.”

Now, a better analyst (you just started learning) works on this experiment and he states you’re wrong.

You kept the null hypothesis but in reality, you should have rejected it.

Well, your error is a false negative:

A false negative happens when you kept the null hypothesis but in reality you should have rejected it. The hypothesis is false.

This is called a Type II error.

So now your result is changed:

“The subscription rate for history and chemistry majors is different and we should reject the null hypothesis.”

Now let’s try with another experiment.

We want to know whether people that took a Coursera certificate are more likely to get a pay raise or not.

The population is around 10000 people, but we took a sample for efficiency.

We call 200 people who have taken the certificates and 200 who did not.

We then found that 25% of people who have taken the certificate got a pay raise, while just 18% of people who have not taken the certificate got a pay raise.

We come up with our null hypothesis:

“There’s no noticeable difference between people who have taken a Coursera certificate and people who did not in their pay raise.”

You analyze the data and you found out that there’s a noticeable difference in the pay raise between people who have taken the certificate and people who have not.

You formulate your results in plain words:

“There’s a difference in the pay raise between people who have taken a Coursera certificate and people who have not.”

As before, a senior analyst with some 20 years of experience comes and he states you’re wrong.

Why?

Because he found a Type I Error.

The analyst:

“You rejected the null hypothesis but in reality, you should not have done it. Now we’re in trouble.”

You:

“Let’s go to that night club you really enjoy tonight…”

The analyst:

“Stop it!!”

And he starts crying.

after having lunch, you analyze the data and you find that he’s right.

A Type I Error is often called a false positive.

As the analyst said, it means your null hypothesis should be kept but in reality, you’ve rejected it.

You then formulate the new results in plain words:

“After accurate analysis, there’s no noticeable difference in the pay raise between people who have taken a Coursera certificate and people who have not. ”

In R, these results are found out using the function intersect().

Let’s say that the outcomes of our last experiments are summarized in four vectors of numbers:

Actual positive
Actual negative
Experimental positive
Experimental negative

The first two ones are true positives and negatives.

The last ones have been determined by running the experiment.

In R:

real_positive <- c(2, 5, 6, 7, 8, 10, 18, 21, 24, 25, 29, 30, 32, 33, 38, 39, 42, 44, 45, 47)real_negative <- c(1, 3, 4, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 22, 23, 26, 27, 28, 31, 34, 35, 36, 37, 40, 41, 43, 46, 48, 49)experimental_positive <- c(2, 4, 5, 7, 8, 9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 27, 28, 32, 35, 36, 38, 39, 40, 45, 46, 49)experimental_negative <- c(1, 3, 6, 12, 14, 23, 25, 29, 30, 31, 33, 34, 37, 41, 42, 43, 44, 47, 48)

To detect a false positive or Type I Error, we’ll intersect the real negative with the experimental positive.

Why?

Well, you rejected the null hypothesis with your experiment, so you’ve got a positive.

In reality, the result is negative.

That’s why.

In R:

type_i_errors <- intersect(real_negative, experimental_positive)type_i_errors

To detect a false negative or Type II Error, we’ll intersect the real positive with the experimental negative.

Why?

Well, you kept the null hypothesis with your experiment, so you’ve got a negative.

In reality, the result is positive.

That’s why.

In R:

type_ii_errors <- intersect(actual_positive, experimental_negative)type_ii_errors

These lines of codes above will return a vector with elements in common between the first and second vectors taken as inputs.

[4  9 11 13 15 16 17 19 20 22 26 27 28 35 36 40 46 49] #Type I[6 25 29 30 33 42 44 47] #Type II

Now, let’s go on with the P-value and Confidence Level.

P-Value and Significance Level

The P-value is the probability of obtaining the difference you saw from a sample if there really isn’t a difference for all the population.

P-values help determine how confident you can be in validating the null hypothesis.

Let’s make an example.

Do you remember the history and chemistry majors?

We had got a 39% subscription rate to our volleyball team for chemistry majors, and 34% for history majors, with a difference of 5%.

We run a test on the experiment and among other data, we find to have a p-value of 4%.

This means that we’d see at least a 5% difference only 4 times out of 100 due to sampling error, given the assumption the null hypothesis is true (that’s where we start).

The significance level is a threshold for your P-value.

Conventionally, it’s been set at 5%, so you’d have a 5% probability of getting a false positive.

When you’re running an A/B test, this is extremely important.

However, sometimes it happens you’ll accept a higher significance level to keep moving quickly.

That’s fine if you don’t want to lose your shirt.

Keep in mind that your P-value:

does not tell you that B > A
does not tell you the probability of making a mistake when you select B over A

These are common misconceptions and important to highlight.

Confidence Level and Power

Now that you found the P-value, you can compute the confidence level by subtracting it to 100%

P_value <- 4%Confidence_Level = 100% - P_valueprint(Confidence_Level) #96%

The main difference between the P-value and the confidence level is in timing:

P-value is obtained after you ran the test and indicates the probability of getting a false positive
The confidence level is set before running the test and affects the confidence interval.

Now, since we usually want to be able to reject the null hypothesis, we need to understand our “powerful” is our test.

That’s why we need to introduce the concept of statistical power.

Statical power is:

“the likelihood that a study will detect an effect, when there is an effect to be detected.”

And it’s determined by:

size of the effect you want to detect
size of the sample used

Outcome #1:

The bigger the effect the easier it is to detect.

Outcome #2:

The bigger the sample size the easier it is to detect.

When you have an inaccurate sample size, you’re likely to get into an underpowered A/B test.

That means you don’t have enough data to determine the result accurately.

Your probability of getting a false negative type II error is higher than it should be.

Actually, you can overpower an underpowered A/B test to make up for the difference, but not too much.

Why?

Because by doing that you can actually achieve the opposite outcome, getting a false positive type I error.

So how do we regulate?

Conventionally, you want to keep your statistical power around 80%, which means there’s a 20% probability of getting a type II error for your A/B tests.

Now let’s finally do some testing!

One Sample T-test and Multiple T-tests

Suppose you run a blog and you estimate the average age of your readers to be 30.

Yesterday you got 500 visitiros and the average age was 32.

Are the visitors older than expected or it’s just due to sampling error?

First, let’s set a null hypothesis:

“The sample belongs to a population with the target mean.”

“The average age between the sample and the population are equal”

In R, you can test this using the t.test() function.

The t.test() function takes as inputs:

Values of your sample
the argument mu, indicating the desired mean
expected_mean, indicating the value of your desired mean

Let’s code this:

load("ages.Rda")ages # [33 34 29 30 22 39 38 37 38 36 30 26 22 22]ages_mean <- mean(ages)ages_mean #32results <- t.test(ages, mu = 30)results# data:  ages
# t = 0.59738, df = 13, p-value = 0.5605
# alternative hypothesis: true mean is not equal to 30
# 95 percent confidence interval:
# 27.38359 34.61641
# sample estimates:
# mean of x :32

The P-value is higher than the significance threshold usually accepted, so this means that we might get a false positive, hence rejecting the hypothesis when we should not.

However, in the business world, we can accept a 5–6% P-value.

In this case, we accept the alternative hypothesis, hence the true mean is 32, not 30.

Now, let’s see how t.test() with multiple samples work.

You want to compare two samples of your traffic:

the average age of last week’s orders
the average age of this week’s orders

You calculate the means using R:

last_week_mean <- mean(last_week)last_week_mean  # 25.44this_week_mean <- mean(this_week)this_week_mean # 29.02

And you run the test to find out:

results <- t.test(week_1,week_2)results # t = -3.5109, df = 94.554, p-value = 0.0006863
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -5.594299 -1.552718
# sample estimates:
# mean of x mean of y 
# 25.44806  29.02157

With a very low P-value and the right confidence interval, we can surely state that there’s a difference between the two sample means and the difference is not due to sampling error.

Multiple T-tests and ANOVA

We’re actually a CRO agency and our client is in a hurry.

He wants to speed up the process and he asks you to run multiple t.tests() between three different samples.

Your client believes that the P-value stays always the same.

However, you know your thing and you explain to him the exact opposite:

“Running N t.tests() means you have to subtract to 1 the confidence level multiplied by N times, therefore highly increasing the chance of getting a false positive Type I error.”

If your confidence level is 95% and you run 3 tests between 3 samples, then the probability of getting a false positive Type I error is:

prob_error = 1 - (0.95*3)prob_error = 0.14

This error is unacceptable in statistics, and you explain it to the client.

Now, if your client insists the only way to keep your error probability low is to use ANOVA or Analysis of Variance.

In the case you’re comparing the means, ANOVA tests the null hypothesis that all of the datasets you are considering have the same mean.

If you reject the null hypothesis using ANOVA, you're saying that at least one of your sample has a different mean, but it doesn't tell you which one.

If you want to know which is the one, you’ll need to perform uni-factorial analysis.

In R, the ANOVA function is aov() and it takes as inputs the two vectors of samples and combine them into a new data frame, a table.

Let’s say that you want to test the scores at a given game for each major in your college.

Well, in R you’ll use:

results <- aov(score ~ group, data = df_scores)

Note: Score ~ group indicates the relationship you want to analyze or how each major relates to the game score.

To retrieve the P-value you need, you’ll run the following piece of code:

summary(results)

In this case, the null hypothesis is that

“all the majors score the same results at the video game”

If you reject the null hypothesis, you can confidently state that a pair of datasets is significantly different.

As we said though, you do know which ones.

You talk with the client and now he finally understands.

You can keep going with simple A/B testing, not multivariate A/B testing.

Feedback

Awesome, this was the last piece of our article.

How did I do?

Did you understand the concepts or not?

Let me know if you have any questions in the comment section below.

For now, enjoy your day!

Marco

A/B Hypothesis Testing Explained Using R

Samples, Hypthothesis, P-value, Significance, Errors, and Tests

Prelude

How Can A/B Hypothesis Testing Help you?

Sampling, sample size, and Population

Hypothesis Creation and Null Hypothesis

Type I and Type II errors

P-Value and Significance Level

Confidence Level and Power

One Sample T-test and Multiple T-tests

Multiple T-tests and ANOVA

Feedback

Written by Marco Basile