Hypothesis testing, P-value and some statistical tests

Published in

Analytics Vidhya

11 min readSep 10, 2020

We are going to understand hypothesis testing and P-value in detail and also conduct a few tests in python. If you are new to these terms, don’t worry! I got you covered.

You went on a school trip with all your close friends at school. Guess where? To a really big national wildlife sanctuary. You had your lunch and wanted to go for a walk with your best friend into the sanctuary. Though no one is allowed beyond the danger zone, you and your friend somehow manage to get through that. You were walking through the jungle and alas! YOU SEEM TO BE LOST.

As a brave child, you kept walking to figure out the route to your school friends and teachers. But all your efforts were in vain.

You both halt at a mango tree and then decide to rest, and fill your tummies with those juicy mangoes. Your friend started plucking some mangoes from the tree as you relaxed in the shade of it.

Your friend plucked some of the fruits, and you both started eating them. He exclaimed, “More than half of the fruit is overripe!”.

“More than half of the fruit is overripe!” is your friend’s claim. In statistical language, we call it ‘HYPOTHESIS’ or ‘NULL HYPOTHESIS’. An ‘ALTERNATE HYPOTHESIS’ claims the exact opposite of what a null hypothesis does.

What is Hypothesis testing?

You got to know what a hypothesis is. When we evaluate two mutually exclusive(both cannot occur at the same time) statements on a population, using a sample, that’s hypothesis testing. There are a lot of procedures to test a hypothesis, such as:

One sample proportion test
Chi-square test
T-test
ANOVA
Z test

and many more… You’ll be surprised to know that there are more than 100 tests for hypothesis but one cannot cover all of them up. However, we’ll be discussing some of the important procedures, later in this article.

In brief:

In the above example, your friend already made a null hypothesis. But you won’t be provided with the null hypothesis in all the cases. You will have to make it up for yourself.

So, the first and foremost step in hypothesis testing is identifying a null and an alternative hypothesis.

The main objective of hypothesis testing is to test whether the null hypothesis is true or not. To do that, we need data. You have to collect and arrange data as per your convenience. I’ll try to explain this with the help of the above example itself.

You wanted to test the credibility of your friend’s claim. As you are alone in the jungle and have nothing to do, you collect 5 samples of 10 on the tree to find out there are [4,6,2,3,7] overripe mangoes respectively in each sample. We’ve collected data for 50 mangoes so our population size is 50 and we are gonna test if more than 25 mangoes are overripe. There you go! you have your data.

Anyway, we are not going to do a statistical test on this for now. But we’ll try to look at those things which we need to do a hypothesis test. Even before that, we need to discuss the errors we might make while testing for a hypothesis. There are two types of errors.

What is Type — I error?

When we reject the null hypothesis when it is actually true, It’s a type-I error. This may occur due to many causes such as insufficient data, improper techniques etc.

What is Type — II error?

This is the opposite of a type-I error. When we are not in a position to reject the null hypothesis even if it is false, that’s called a type-II error.

What is the level of significance(alpha)?

The probability of committing a type-I error is called the level of significance. Tests are usually run with an alpha level of 5% or 1%. It is called the rejection region i.e, if the P-value is below the significance level, we reject the null hypothesis.

This is totally based on how much risk you want to take. If you want your results to be 95% correct, then your alpha level would be 5%. If you want your results to be 99% correct, then your alpha level would be 1%. I am afraid, your test will not give 100% accurate results at any given point of time, excluding some infrequent cases.

Here arises a super interesting question. Why would I choose my accuracy to be 95% when I can actually choose it to be 99%? Here’s something we need to understand.

Why is the level of significance typically chosen as 5%?

I’ll repeat the definition of the level of significance for you. “The probability of committing a type-I error is called the level of significance”. We’ve discussed type-I error also. If we reject the null hypothesis when it is true, it is called a type-I error.

So, when we decrease the value of the level of significance, We decrease the probability of committing a type-I error, and that’s fine. But at the same time, we increase the possibility of committing a type-II error. So it is always better to assess the risk in the situation and set your alpha value.

Now we are going to discuss one-tailed and two-tailed tests.

What is a one-tailed test?

Let us again look at the above mango example. What did your friend say? He said, “More than half of the fruit is overripe!”, and that’s our null hypothesis. So what would be the alternate hypothesis?

It is “Not more than half of the fruit is overripe”. We need to test if the number of overripe mangoes are greater than 50% of the total fruit.

This kind of test will be the one-tailed test. Let me tell you about the two-tailed test so that you can understand both of them in detail.

What is a two-tailed test?

A government official claims that the dropout rate for local schools is 25%. Last year, 190 out of 603 students dropped out.

Be clear with your null hypothesis. It would be, “Exactly 25% of the students drop out every year”. And the alternative hypothesis would be, “The percentage of students dropping out is not exactly 25%”. This would be a two-tailed test as we reject the null hypothesis if the percentage is significantly less than or greater than 25%.

But in the mango example, we reject the null hypothesis only if the overripe mangoes are significantly less than 50% of the total mangoes.

There’s another point to be noted. In two-tailed tests, the level of significance splits into two i.e, if you chose your alpha level to be 5%, it would be 2.5% for each of the two tails.

What is P-value?

Suppose you flip an unbiased coin twice. Can you tell me what’s the probability of two heads turning up? The total possible outcomes are [HH, HT, TH, TT]. As probability is the number of favourable outcomes divided by the total number of outcomes, the probability would be 1/4 or 0.25.

Now, what’s the P-value for two heads turning up? Usually, P-values are given in tables for each test under given conditions such as two-tailed test, degrees of freedom etc. But today, let us learn how these P-values are calculated and used.

P-value is the sum of three parts.

The probability of the event.
The probability of events which are equally rare.
The probability of events which are rarer.

How to calculate P-value?

Consider the above example. Let us try to calculate the P-value of two heads showing up when we toss a coin twice.

The first part is the probability of the event which we’ve already calculated i.e 0.25. Now comes the second part. The probability of events which are equally rare, i.e probability of two tails showing up which is also 0.25.

The third part is the probability of events which are even rarer. There are no such events which are rarer than two heads showing up in our case. So the third part would be zero. If we add these all up, we get the P-value 0.5.

Most of the people out there, mistake probability to P-value. Hope you got the difference now.

Now you might ask a question. This is a small experiment and there are only 4 total outcomes, so we were able to calculate the P-value easily. What if there are hundreds of observations? Thousands? what would we do then?

We use something called a density.

Suppose we are measuring the heights of women in a college. We’ve collected the data and plot it in the form of a normal distribution. For this example, refer to this video.

Most of the values lie between 142 and 169 cm. To be exact, 95% of the values lie between 142 and 169 cm. which indicates that there is a 2.5% probability that each time we measure the height of a woman, it would be more than 169 cm and 2.5% probability that it would be less than 142 cm.

Now let us get to calculate P-values. What is the P-value for someone whose height is 142 cm?

1st part(Probability of the event): As we have hundreds of observations, the probability of someone’s height is 142 cm would be negligible.
2nd part(Probability of events which are equally rare): As the first part tends to zero, this would also tend to be zero.
3rd part(Probability of events which are rarer): Now, we have to consider the values in the tails as their probability of occurrence is small compared to someone whose height is 142 cm. That would be 2.5%+2.5% results to 5%.

So the P-value for someone whose height is 142 cm is 0.05. Now let us have a look at some tests we generally use for testing of hypothesis.

T-test

There are two types of t-test — One sample T-test and Two-sample T-test.

These are exactly how they sound. One sample T-test is used when we are dealing with one sample and two sample T-test is used when we want to compare two different samples.

One sample T-test

As I have mentioned earlier, one sample T-test is used when there is one sample(or)different samples from the same population. Let’s do the test on the mango example.

Null hypothesis: Overripe fruits > 25(Half of the total number of fruit) Alternate hypothesis: Overripe fruits <= 25.

Actual overripe fruits are [4,6,2,3,7]

# import one sample t test 
from scipy.stats import ttest_1samp
# This function returns two values. The first one is the t statistic # and the second one is p-value
statistic,p_value = ttest_1samp([4,6,2,3,7],5)
# We considered 5 here because 50% of each sample would be 5. As we # have 5 samples, total would turnout to be 25.
print("The P-value is:",p_value)
if p_value <= 0.05:
    print("The hypothesis is rejected")
else:
    print("Your friend's claim is true! Unable to reject null hypothesis")Output:
The P-value is: 0.5528894339334173
Your friend's claim is true! Unable to reject null hypothesis

Two sample T-test

We use this test when we are considering two independent samples i.e they belong to different populations, it is also known as independent T-test. Let us consider an example where there are two samples of heights of tobacco plants from two different fields. Field_1 heights are given as [69,56,84,63,34,45,73,65]. Field_2 heights are given as [46,34,23,56,42,54,32,49]. Let us find if there is a difference between the yield of two fields.

Null hypothesis: There is no difference between the yield of two fields.

Alternate hypothesis: There is a difference between the yield of two fields.

field_1,field_2 = [69,56,84,63,34,45,73,65],[46,34,23,56,42,54,32,49]
# import independent t test which is also called two sample t test
from scipy.stats import ttest_ind
# This function returns two values. The first one is the t statistic # and the second one is p-value
statistic,p_value = ttest_ind(field_1,field_2)
print("P-value is:",p_value)
if p_value <= 0.05:
    print("We reject the null hypothesis. There is a difference between the yield of two fields")
else:
    print("We fail to reject the null hypothesis. There is no difference between the yield of two fields")Output:
P-value is: 0.015464452955301845
We reject the null hypothesis. There is a difference between the yield of two fields

In the above two-tailed two-sample T-test, we observed that the null hypothesis got rejected. But we can see that the P-value is greater than 0.01 which means if we choose our alpha to be 0.01, the hypothesis would have got accepted. But this increases the chance of type-II error which we cannot afford here.

Chi-square test

The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

To perform this test, we are going to import a dataset from the library SEABORN.

# Importing essential libraries and loading the dataset.
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import numpy as np
dataset=sns.load_dataset('tips')print("Dimensions of the dataset:",dataset.shape[0],"rows and",dataset.shape[1],"columns.")
dataset.head()Output:
Dimensions of the dataset: 244 rows and 7 columns.

As we can see, there are 7 columns in the dataset. But we are concerned only about two columns “sex” and “smoker”. We are going to find out if there’s any association between these two columns.

Null hypothesis: There’s is no association between both columns.

Alternate hypothesis: There’s an association between those two columns.

# Making a crosstab for the two columns we want to find the association between.
sex_smoker = pd.crosstab(dataset['sex'],dataset['smoker'])val = stats.chi2_contingency(sex_smoker)
valOutput:
(0.008763290531773594, 0.925417020494423, 1, array([[59.84016393, 97.15983607],
        [33.15983607, 53.84016393]]))

Let us see what the chi2_contigency function returned. The first value is the chi2 test statistic. The second value is the probability, whereas the third one is P-value and the last one which is an array are the expected values.

degrees_of_freedom,expected_values = val[2],val[3]
print("Degrees of freedom:",degrees_of_freedom , "\nExpected values:",expected_values)
# Storing the crosstab 
observed_values = sex_smoker.valuesOutput:
Degrees of freedom: 1 
Expected values: [[59.84016393 97.15983607]
 [33.15983607 53.84016393]]

Performing the chi-square test

from scipy.stats import chi2
chi_square = sum([(o-e)**2./e for o,e in zip(observed_values,expected_values)])
chi_square_statistic = chi_square[0]+chi_square[1]
print('Chi-square test statistic is:',chi_square_statistic)Output:
Chi-square test statistic is: 0.001934818536627623

The chi-square test statistic formula is getting implemented in the above code.

If you don’t understand the code, don’t worry about it. We’ll be discussing these tests in detail theoretically in the future.

# Calculating critical value, it will always be given in your tables. But while coding, we have to calculate it.

critical_value=chi2.ppf(q=1-alpha,df=degrees_of_freedom)
print('critical_value:',critical_value)# PPF is Percent point function which is the inverse of Cumulative distribution function.Output:
critical_value: 3.841458820694124

Just to make the point clear, let me tell you if the chi-square statistic is greater than the critical value, we reject the null hypothesis.

if chi_square_statistic>=critical_value:
    print("Reject null hypothesis,There is a relationship between 2 categorical variables")
else:
    print("Retain null hypothesis,There is no relationship between 2 categorical variables")Output:
Retain null hypothesis,There is no relationship between 2 categorical variables

References

Stephanie Glen. “Hypothesis Testing” From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/
Fundamentals of Statistics (Volume-I): A. M. Goon, B. Dasgupta, M. K. Gupta