**Hypothesis Testing**

Joey: Good afternoon Chandler.

Chandler: Good afternoon Joe.

Joey: What are we doing?

Chandler: Wasting our lives??

Joey: No, I am asking about the lunch

Chandler: We will try burger today.

Joey: Sounds good. Do you remember the problem which we were discussing on the first day of your job?

Chandler: Yeah, we were trying to answer the question “Does the background colour of our website affect the number of clicks which it receives?”

Number of clicks before changing the colour: 30 per day

Number of clicks after changing the colour: 63.875 per day

Joey: Yeah, you told me that we can’t make a decision just by comparing the sample averages, right? And that we would get a different sample average when we rerun our experiments. Till now you explained me about the central limit theorem, normal distribution and inferential statistics, (Please revise the previous blogs if you want to know more about those topics) is there anything else which I should be knowing before solving this problem?

Chandler: Yes, you need to know an important topic in inferential statistics, which is hypothesis testing.

Joey: Yeah, I do vaguely remember you mentioning about hypothesis testing during our discussing on central limit theorem.

Chandler: Let me explain what hypothesis testing is? Suppose, assume that we are going to decide the colour of our website based on the number of clicks it receives. And the number of clicks in our website is 30 per day for the default colour. Now, if the number of clicks in our website is going to increase by changing its colour, we decide to keep the colour permanently. Otherwise, we will stick to our default one.

Here, we need to decide whether to change the colour of the website based on the sample data that we have collected, and its mean is 63.875 per day. We also understand that we can’t make a decision just by comparing two numbers (63.875 per day > 30 per day) because number of clicks is a random variable. And this is where hypothesis testing becomes handy. It tackles these issues in an intelligent way and uses the sample data to make a decision. In other words, hypothesis testing uses sample data to make an inference about the population parameter.

Let me give you some examples which uses hypothesis testing to take a decision

- Doctor wanting to know whether children who take vitamin C are less likely to become ill.
- Manufacturer wanting to check if the product’s quality meets the pre-specified criteria.
- Scientist wanting to know if teenage boys are more prone to behavioural problems than teenage girls.

In all these above examples, it is not possible for us analyse the entire population to arrive at a decision. If the doctor wants to know if the children who take Vitamin C are less likely to become ill, then it will be very costly to scrutinise every child in the world to arrive at a decision and sometimes it becomes infeasible too. So, we always try to make a decision by looking at a sample from the population.

The following procedure is adapted for conducting hypothesis testing,

**Formulate Null and Alternative Hypothesis**: We need to formulate two hypothesis which are the null and the alternative hypothesis. Null hypothesis is usually denoted as H0 and alternative hypothesis as H1. Null and alternative are two mutually exclusive and collectively exhaustive statements about a population parameter. We have to make these statements about the parameter and not about their estimates. The difference between the parameter and an estimate is that the parameter characterizes the population and the estimate characterizes the sample. Estimator is a type of statistic (some function of the samples which we get). Example: Population mean (μ) is a parameter and sample mean (x) is an estimate. An important thing to keep in mind while formulating these hypothesis is that the null hypothesis is a commonly accepted fact (or the default value) and alternative hypothesis is a statement which a researcher want to test. In our problem, the null and alternative hypothesis are

H0: Changing the colour of the website doesn’t influences the number of clicks which it receives.

H1: Changing the colour of the website influences the number of clicks which it receives.

If changing the colour of the website has an influence on the number of clicks, then the average number of clicks will change from the default value (i.e., 30 per day) upon changing in colour. Mathematically it is equivalent to saying

Please note the following things in our hypothesis

- Both hypothesis are mutually exclusive statement i.e., both statements cannot occur simultaneously and collectively exhaustive i.e. it covers all possible options.
- Made hypothesis for population parameter (μ) and not for estimate (x).

The above hypothesis test is also called as two tailed test. Likewise, there is another way to formulate the null and the alternative hypothesis. It’s called one tailed test which are used when our null hypothesis is itself greater or lesser than some pre specified value.

H0: μ ≥30 per day

H1: μ<0 per day

H0: μ ≤30 per day

H1: μ>30 per day

The word ‘null’ in the null hypothesis means that it’s a commonly accepted fact that statistician work to nullify. We can even call it as falsifiable hypothesis. This is one of the reason we usually say either we “reject the null hypothesis” or “fail to reject the null hypothesis” at the end of our hypothesis testing.

**Calculate test statistic for the sample data:**Test statistic is some function of the sample data which compares it with the expected value of the population parameter which would in turn help us to make a decision in hypothesis testing.

Let me digress a little to explain the mechanism behind the test statistic. Do you remember CLT (central limit theorem)?

It states that

*“The aggregation of a sufficiently large number of independent random variables results in a random variable which will be approximately a normal distribution”*

Additionally, it also concludes about the parameters of the normal distribution (i.e., the sampling distribution of sample means) which are:

μ= Population mean (Original/Parent distribution of the observations)

σ = Population standard deviation

x = sample mean

- The mean of the sampling distribution of sample mean is

E[x] = μ

Where x is the sample mean.

- The standard error (also called as standard deviation of the theoretical distribution) of the sampling distribution is given by

(Please revisit our Sampling distribution of sample mean and central limit theorem blog to understand the intuition behind the theorem)

The CLT helps us to create a distribution for null hypothesis (population parameter) which is called as the null distribution. Assume that we knew the population standard deviation of the observations (σ). Then, the distribution for the null hypothesis is given by the CLT, which is N(μ,σ2n) . In our problem, assume that we have found the population standard deviations which turns out to be 150 and the sample size is 8 (Collected 8 observations –Table 1). So, the null distribution for our problem is N(30,258) (the distribution of the sample means) Therefore, the pdf for the null distribution is

I think now you are ready to understand how test statistic compares the sample data with the expected value of the null hypothesis or the population parameter. There are a variety of test statistics which are selected based on some criteria. In our case, we will be using Z- statistic. They are used when the following conditions hold true

- Performing hypothesis testing on the population mean
- We assume that we knew the population standard deviation (σ )

Z-statistic is defined as

The above formula must be familiar to you, it is just a way to convert a normal distribution into a standard normal distribution Z. The only difference here is that we are using σn in the denominator instead of σ. The reason being that the standard deviation of sampling distribution of sample mean (theoretical distribution of sample mean) is σn. we also knew that the standard normal distribution can be seen as a scaled and normalized version of a normal distribution. The below figures depicts the same.

To calculate z-statistic, we knew all the values except the sample mean (X). In our example the collected sample was:

The value of z-statistic (3.83) tells us how much far away the sample mean is from the null hypothesis mean (Positive value corresponds to sample mean being higher and negative value corresponds to population mean being higher).

**Calculate p-value (probability value):**The Z-statistic tells the distance between the sample mean and the hypothesized mean (H0) in terms of standard deviation. A Z score of 3.83 indicates that sample mean is 3.83 standard deviations away from the hypothesized mean. From the normal distribution blog we know that:- We can calculate the probability for a given Z-score.
- A very high or a very low (negative) Z scores corresponds to a very small probability value and are found near the tails of a normal distribution.

Once the Z score is computed the corresponding probability value is obtained from the Z table (procedure outlined in the normal distribution blog). Higher the probability value, higher is the probability of observing the collected sample from the theoretical distribution, this is because X is a sample from the same theoretical distribution (N(30,258)). A low probability value indicates that X might have come from other distribution.

**Decision Making based on the significance level:**Till now we have formulated the hypothesis, calculated the test statistic and computed the p-value. As mentioned earlier the p-value is the probability of observing the collected sample from the theoretical distribution or in other words it is the probability of obtaining a result at least as extreme as the one observed (sample) assuming that the null hypothesis is true. To put it in a nut shell the probability value gives the probability of observing the collected sample in the null distribution. In order to make a binary decision in the hypothesis testing i.e. either rejecting the null hypothesis or failing to reject the null hypothesis, the p-value alone is not enough, we need to fix a threshold in the theoretical distribution. For now let’s assume that the threshold is 0.1

- p-value > 0.1 (Decision: Fail to reject the null hypothesis)
- p-value < 0.1 (Decision: Reject the null hypothesis)

The number 0.1 is called as the significance level which typically represents the level of acceptable error in our decision. Generally, the decisions are made as follows:

- p-value > α (Decision: Fail to reject the null hypothesis)
- p-value < α (Decision: Reject the null hypothesis)

Where α is the significance level. Statisticians and Researcher have to decide on a value of α before conducting the hypothesis testing. Typical values for α are 0.1, 0.05, and 0.01. Let us calculate p-value for our problem which is two tailed Z-test,

Let us assume a α value of 0.10. Now, since the p value is less than α we reject the null hypothesis. In other words, changing the colour of our website has influences on the number of clicks.

Joey: Wow, I didn’t realize the importance of hypothesis testing until this conversation. But I still have a doubt…Why don’t we say “accept the alternate hypothesis” instead “reject the null hypothesis” since both means the same?

Chandler: When our p-value is very low, the only conclusion we can make is that X is not from the theoretical or the null distribution. We can’t make any conclusion about the alternative hypothesis. And that’s why we always say reject null hypothesis instead of accepting the alternate hypothesis.

Joey: Make sense. If my understanding about the p-value is correct, the computation of p-value calculation is different for one tailed and two tailed test. Am I right?

Chandler: oh yes, you are absolutely correct. P-value is the probability of obtaining a result **at least as extreme** as the one observed (sample). Let ‘a’ be the z-score.

Joey: Okay, based on our conversation, I understand that p-value is the probability of observing the collected sample in the theoretical distribution or the null distribution. Now, for a one tailed test I should be calculating P(Z=a) and not P(Z≤a) right?

Chandler: Let me put it this way, we know that the probability of observing a sample in a normal distribution is zero since it is a continuous distribution. Therefore, it is not possible to calculate the probability of observing the sample we collected in the null distribution as it always results in zero. That is the reason why we define the p-value as probability of obtaining a result **at least as extreme** as the one observed (sample).

Chandler: Okay let’s pull the data from our database and run the hypothesis to check if change in colour of website has any influence on the no. of clicks which it receives.

Let red be the default colour and blue be the new colour to which we change the website, also assume that we know that the average number of clicks is 30 when the colour red (from historical data), now we change the colour to blue and record the number of clicks per day for 8 days (assume σ=25 and a=0.05 )

No. of clicks per day when the background colour is Blue:

The corresponding p-value is 0.0001 which is less than the threshold so we reject the null hypothesis and conclude that changing the colour of website influences the number of clicks.

*The author of this blog is Balaji P who is pursuing PhD in reinforcement learning at IIT Madras*