Hypotheses Testing (Part 1)
Hypothesis Testing is considered as one of the most important procedure in Statistics. It basically answers the simple question that whether a claim about the population which is drawn from the sample is true or not.
It is a statistical process of either rejecting or retaining a claim or belief related to a business, product, service etc.
Let’s take an example to understand it a bit more, Suppose your friend makes a claim that his long term batting average is 75 and to verify his claim you go onto see him playing, However he only manage to score of 40 runs in 3 matches now you have to make a decision about his claim there are the following 3 choices :
- Are you going to believe in your friend’s claim ?
- Are you going to reject his claim ?
- Not enough evidence to make a decision ?
We can also use Hypothesis Testing to test claims made by various organization like:
- Children who drink Health Drinks are likely to grow taller
- Hand wash can kill up to 99% of germs
To come to a decision backed by Statistics we use Hypotheses Testing. I hope now it’s clear to all of you why do we need a Hypotheses Testing as it can help us in checking claims made.
Following are the prerequisites before we jump onto Hypotheses Testing
- Sample and Population
- Standard deviation
- Normal distribution
- Central limit theorem
In the following article we are going to cover :
- How to define a Hypothesis?
- One Tailed and Two Tailed Tests?
- Type-I and Type-II error?
- What are types of test and When to use them?
- One Tailed and Two Tailed test of means?
Defining a Hypotheses : Hypotheses Testing consists of two complementary statements null hypotheses(H0) and alternative hypotheses(H1) these two consists of all the possible outcomes of the study.
Null Hypotheses(H0): It states that nothing new is happening i.e. old standard is correct
Alternative Hypotheses(H1): It states that New Theory(claim made)is true i.e there are new standards
Suppose a broker claims that the average price of an apartment in a metropolitan city is more than 50 Lakhs. Null and Alternative Hypotheses in this scenario will be:
H0: μ ≤ 50, H1: μ>50
In the example,research made a claim regarding an average price of an apartment in a metropolitan city.
Null hypotheses states that nothing new is happening (H0: μ ≤ 50).
Alternative hypotheses states that new theory i.e. the claim made is true (H1: μ>50).
In this manner we can define a null and alternative hypotheses one thing to keep in mind is that ‘=’ sign always comes in null hypotheses.
One Tailed Tests: Hypotheses are written in such a manner that either they produce one tailed test or two tailed test. Consider a following example:
An average salary of Data Scientist is at least 18 Lakhs per annum. Null and Alternative hypotheses will be as follows:
H0: μ ≤ 18 , H1: μ>18
In this example, to reject the null hypothesis the rejection region will be on the right-side of the distribution also known as Right-Tailed Test.
Similarly if the claim was made that an average time to get a passport in India is less then 30 days. Null and Alternative Hypothesis will be in this case
H0: μ ≥ 30 , H1: μ < 30
In this example to reject the null hypothesis the rejection region is going to be on the left-side of the distribution also known as Left-Tailed Test.
In One-Tailed test, the alternative hypothesis use either greater then(>) or less then(<) sign.
Two-Tailed Test: It always uses equal and unequal signs in the null and alternative hypotheses e.g. A manufacturer wants to check that whether the machine is filling 500 ml of beverage or not.
Null and Alternative hypotheses in this case are going to be:
H0: μ = 500 , H1: μ ≠ 500
In this example the rejection region can be on either side of the distribution as shown :
Type I and Type II error : While testing hypotheses it’s possible that we can make an error. There are basically two types of error that can be made Type I and Type II error as given below.
Rejecting a true null hypotheses is called a Type I error.
Retaining a false null hypotheses is called a Type II error.
Type I error (α) = P(Rejecting a null hypotheses | H0 = True)
Type II error (β) = P(Retaining a null hypotheses | H0 = False)
α and β are the probabilities of having a Type I or Type II error
1-β is also known as the power of the hypotheses test is defined as the probability that a false null hypotheses will be detected.
What are the types of test: There are two types of test mentioned below:
- Parametric Test: It relies on the assumption that given data follows the normal distribution the examples mentioned above comes under parametric test as its follows the normal distribution. It is used to analyze group means e.g. z-test,t-test.
- Non- Parametric Test: It doesn’t require that data given follows a normal distribution and it’s used to analyze group median.
The examples that we just discussed is a type of a z-test or t-test
What are z-test and t-test: We can use z-test and t-test to analyze whether sample mean represents the population mean.
We can calculate z-score that represents how many standard deviations a particular value is away from the mean.
Similarly t-test follows a t-distribution that follows a bell shaped curve but has heavier tails. In t-test we can calculate t-score that shows how much a particular value differ from the average.
When to use z-test and t-test : Before testing the hypotheses we need to chose whether its a normal distribution or a t-distribution
Following are the Prerequisites for choosing z-distribution:
1. Population standard deviation is known and the population is normal
2. Population mean is known and the sample size is at least 30
Formula for calculating a z-score is : z = (x — μ) / (σ / √n)
x = Sample Mean
μ = Population Mean
σ = Population Standard deviation
n = Sample Size
After calculating the z-score we can use z-table to get the probability also known as p-value, If we are checking a claim at 5%(0.05)level of significance and the probability we are getting from a z-table for a z-score is greater then 0.05 we can accept the null hypothesis otherwise we can reject it.
t-test: We use t-test when the population is normal and the standard deviation of the population is unknown but the standard deviation of the sample is known given by t= (x — μ) / (S / √n)
x =Sample Mean
μ = Population Mean
S = Sample Standard Deviation
n = Sample Size
After getting the t-score we can use t-table with n-1 degrees of freedom to get the p-value and if its greater then the specified level of significance then we will be accepting a null hypothesis otherwise rejecting it.
If you have some doubt don’t worry in the next section we are going to take an example of hypothesis test and either we are going to accept the null hypothesis or reject it.
One tailed test of means : In this example a hypothesis test is carried out with one sample so it also know as One sample test
Question: Taken a sample of 30,000 families, the average annual income was found to be at least 3,250 (INR). Assuming that the population Standard Deviation is 2200 (INR) and population mean is 3200 check the validity of the claim made by the agency at 5% level of significance
Solution: 3 sequential steps that needs to be followed.
STEP 1: Establish the hypotheses
STEP 2: Identifying and performing the type of test
STEP 3: Statistical Conclusion
Let’s start implementing the steps mentioned above
STEP 1: H0: μ ≤ 3250 , H1: μ>3250 (Defining the hypothesis)
STEP 2: In the question mentioned above these are the following parameters we have
x = 3250 , σ = 2200, n = 30,000, μ = 3200
As we know the population standard deviation and mean also the n is greater then 30 we can use z-test with 0.05 level of significance
After identifying the test we can calculate z-score by importing values in to formula :
z = (x — μ) / (σ / √n)
z = (3250–3200)/(2200/√30000)
z = 3.93
So the value of z-score is 3.93 now refer to z-table to calculate the probability i.e. probability of getting a z-score at least this extreme when the null hypothesis is true = .000042 (link) to calculate a z-score also we can use NORM.S.DIST function in excel to calculate the p-value.
Since the rejection region is going to be on the right-side of the distribution it also known as Right-Tailed Test
Statistical Conclusion: As p-value is less then level of significance, we reject the null hypotheses p-value(0.000042)< α(0.05)
Two-Tailed test of means: In the following example we are going to conduct Two-Tailed test of mean when we know the Standard Deviation.
Question: According to the company IQ Research, the average IQ of Indians is 82 derived based on a research, using data collected from 2002 to 2006. Population standard deviation is 11.03 on a sample of 100 people from India, the sample IQ was estimated as 84.Conduct an appropriate hypotheses test at 5% level of significance to validate the claim of IQ research that the average IQ of Indians is 82.
STEP 1: H0: μ = 84, H1: μ ≠84
STEP 2 : It is known as two-tailed tests as rejection region lies on both sides of the distribution also we know the population mean and standard deviation given that sample size is 100 we can use the z-test
x = 84, σ =11.03 , n = 100, μ = 82
z = (x — μ) / (σ / √n)
z = (84 -82)/(11.03/√100)
z = 1.81
After calculating the z-score we can calculate the p-value i.e = 0.0351 as it’s the two tailed test we are gonna multiply it by 2 shown below :
p-value = 2*0.0351
p-value = 0.0702
Statistical Conclusion: As p-value is greater then level of significance we accept the null hypotheses
Thanks for reading !
I will soon be sharing another article in which, I will be covering:
1. One tailed and Two tailed test of mean : t-test
2. Two sample hypothesis test
3. Paired sample t-test
4. Introduction to ANOVA