The A-Z of Hypothesis Testing

Mehul Gupta
Data Science in your pocket
6 min readAug 2, 2019

--

I am done with some basic concepts and different data distributions one must know when beginning with statistics.
If you are a vlog person

Now, time to move on.

You all would have heard of Hypothesis testing! Let's dissect it this time

TAKE A DEEP BREATH and get started!

Quite sure for many, it's still a black box! Before starting, let me give you a condition!

Suppose we are given a dataset with time(Jan, Feb, March, etc as columns) and weight_of_a_person(as rows like each row representing a different person). Now, if I ask you whether the season influenced weight gain among the persons, how would you approach this problem?

This is a common example of extracting insights from data!

In Data Science, for such a problem, we need Hypothesis Testing.

Now, lets the first pen down the steps to follow:

  1. Decide over the null and alternate hypothesis
  2. Set significance level
  3. Calculate Z statistic/T statistic (whichever is applicable)
  4. Depending upon the type of test (One or Two-tailed test), calculate the p-Value
  5. If the p-value is less than the significance level, we can conclude that we can reject our null hypothesis.

Feeling Confused? No Need To Worry

Before any explanation, let's pick up a problem.

Now, let us start by stating our Null Hypothesis. Do remember that the null hypothesis supports equality that is it takes that no change has been observed. While the Alternate Hypothesis is the opposite of the Null Hypothesis.

Therefore, for this problem our:

Null Hypothesis: No temperature change has been observed between average human temperature and that of a 17 year old.

Alternate Hypothesis: Temperature of a 17 year old is greater than the average human temperature.

Done with the first step!

Choosing the significance level: The significance level also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that temperature difference exists when no difference between the two groups formed exists.

I will be keeping it simple, 0.05 i.e 5%

Step 2 done.

Before moving on, let's determine whether the test is One or Two-Tailed.

One Tailed: When we determine an alternate Hypothesis containing inequality(either > or <), it is one tailed.As in our alternate hypothesis, we have stated that Temp. for 17 year old is > Temp. for average, it is one tailed.

Note:As The Temp. for 17 year old> Avg Temp, it is a upper tailed test

Two Tailed:When we determine an alternate Hypothesis containing inequality(but not < or >), it is two tailed.Example: If the given question says that Temp. of 17 old is unequal to average Temp but doesn’t determine whether it is greater than or less than Avg Temp., it would have been Two Tailed.

Now the big question, What to calculate:

Z Statistic or T Statistic?

Let me make this easy for you!!!

Z Statistic is calculated when the data for the entire population is known.

Example: For the given problem statement, if mean(avg Temp) and standard deviation for the entire population of the world(about 7+ billion people) is taken, we would have been calculating Z Statistic

T Statistic is calculated when we take a sample out of entire population(sample size can be 10,20,30,50,100 or whatever)

Example: As in our case, we have considered mean & standard deviation for 25 people out of the entire population.Hence we would be calculating T statistics for our problem.

NOTE: You might often hear people saying if the sample is greater than 30 or 40 or whatever, we need a Z Stat else a T Stat. This theory is completely wrong!

Now, looking at the formulae for the two cases:

Z Statistic: (x_alternate — x_null) / standard_deviation

T Statistic: (x_alternate — x_null) * √n/standard_deviation

Where:

x_alternate=Mean according alternate hypothesis(temp for 17 year old in our case)

x_null=Mean according null hypothesis(avg temperature)

n=No.of samples(25)

We would be calculating T Statistic in our case(as no. of samples mentioned)!!

Now left with the last step!

According to whatever we have calculated (T-Stat or Z-Stat), we would be using a table similar to the log table i.e. T Distribution/Z Distribution table.

But first of all, let's understand p-Value:

the p-value is the value we are trying to calculate all this through.

“The p-value is the probability that the data would be at least as extreme as those observed if the null hypothesis were true.”

It took me some time to get this statement. I am sure that you all are still thinking.

Go through the below text and you will get it

Now moving on to the calculating P-Value using the T-Statistic:

In this T Distribution table, the Degree of Freedom refers to the number of samples(24(n-1) in our case) and alpha refers to the significance level. Now, we will try to locate our calculated T Stat in row 24. The closest value we find, we will observe the corresponding alpha for it (in the column) which would be our p-Value. If it's less than the significance level(Step 2), we can reject our null hypothesis.

Example:If Degree of Freedom is 10(11 samples) and the t-Stat is 1.9, we will be going in 10th row,6th column which corresponds to 0.05 alpha. Since the p-Value isn’t less than significance level, we can’t reject the Null Hypothesis.

For Z Statistic, we have got a Normal Distribution table:

Here, in the column, our Z Stat till one decimal place is a row and the 2nd decimal place is as a column. The corresponding intersection will give us the p-Value.

Example: Let us assume we have our Z Stat as 2.55. For this, we will move to row with 2.5 and column with 0.05 and the corresponding value obtained is 0.99461. Remember for calculating p-Value from Z Stat:

* If Z Stat is negative, the value obtained by above method is p-Value else it is (1-the obtained value from the table is the P-Value).

Hence P-Value would be 0.0539>0.05. Therefore, we can’t reject the Null Hypothesis.

This is when we have a One-Tailed Test.

For the Two-Tailed Test, calculate p-Value as above. Just double the value and then compare it with the significance level. If we get a p-Value 0.2, it will be 0.4 for a Two-Tailed Test.

Before finishing things off, just remember Hypothesis Testing can only be used to reject a Null Hypothesis. But it never says anything about accepting any sort of Hypothesis. That is, even if p-Value comes greater than the significance level, we need to perform some more tests to accept any Hypothesis!!

I guess this would take some time to digest therefore I won’t be covering ANOVA this time. But would be coming out with it soon

--

--