Picture from pixabay.com

How to approach Hypothesis Testing

Dhruv Aggarwal
Analytics Vidhya
Published in
4 min readDec 27, 2019

--

Testing a hypothesis is an integral step in machine learning, before building models that adds business value.

And this step seems really important for every data scientist or machine learning engineer. While we focus only on building models and improving their accuracy, we tend to forget that building a hypothesis without actually seeing the data is a crucial step.

And with this thought, we tend to build a biased model(One of example) which afterwards don’t perform well and we end up thinking what went wrong!

So, to avoid it we generate hypothesis. I have read many articles, blogs online but didn’t find any easy explanation of hypothesis testing. So, I will keep it simple and understandable.:)

This post is about explaining hypothesis testing, need of hypothesis testing, and simplifying it with an example.

What is Hypothesis Testing?

A Hypothesis Test is a statistical method which evaluates two mutually exclusive statements(hypothesis) about a population and determines which statement satisfies the sample data.

Did you get it?Obviously some bit of it. So, we try to understand with basic.

A hypothesis is a claim or a statement about a parameter of a statement.

Picture from PopOptiq.com

For example, “The Population Mean is 120”.

  • Each hypothesis implies its a Contradiction or an Alternative.
  • Is either True or False?
  • Can be rejected on basis of Trial Testimony, Evidence or Sample data.

Types Of Hypothesis

  1. Null Hypothesis: It is the first step in hypothesis testing.
  • It is denoted by H0(pronounced as “H not”) and is usually a hypothesis of “no difference”.
  • It is performed under a possible rejection under a true assumption and always refers to a specified value of population parameter .

2. Alternate Hypothesis: It is complementary to Null Hypothesis.

  • It is denoted by H1(pronounced as “H one”).
  • It is used to decide whether to employ a single tailed test or two tailed test.

Pointing to previous example,

“The population mean is 120”; the following hypothesis can be defined as

H0 : u =120 (Null Hypothesis)

H1 : u ≠ 120 (Alternate Hypothesis)(Two Tailed Test)

H1 : u > 120 (Right Tailed Test)(Single Tailed Test)(Alternate Hypothesis)

H1 : u < 120 (Left Tailed Test)(Single Tailed Test)(Alternate Hypothesis)

Why Hypothesis Creation?

Hypothesis Creation is an important step we all should do.

Picture from blueskyresume.com

Also:

  • Before exploring the data, we need to understand the relationship between variables in the dataset, so we should first form hypothesis.
  • Sounding Counter-intuitive?No. For Solving the problem, we should first spend sometime thinking about the Business problem, gaining Domain knowledge and also gaining first hand experience of the problem.
  • How does it help?This practice usually help us form better features later on during Feature Engineering which are free from any bias by data available in the dataset.
  • Type of Brainstorming before seeing the data?Yes, you got me. It basically involves brainstorming and coming up as many ideas as possible about what could affect the target variable. Get your idea generating part of brain to maximum efficiency:p

So, Hypothesis Creation should be always done before seeing the data or else you will end with the biased hypothesis and less accuracy.

Always come up with many Hypothesis of your own, the more the better. It will help us in getting deep inside the problem and come up with the best features with high correlation.

Performing Hypothesis Testing

To perform a hypothesis test, we need to have clear understanding of some basic terminologies like Type 1 error, Type 2 error, level of Significance etc. For Clear understanding, Refer .

So, Problem statement is as follows:

A Snack manufacturing company claims that maximum saturated fat content in chip packet is 2 grams with standard deviation of 0.25 . A test on a sample of 35 packets reveal that mean saturated fat is 2.1 grams. Should the claim of snack manufacturing company be rejected?

Picture from mobile-cuisine.com

Let’s test the Null Hypothesis at the significance level of 5%.

Step 1: Set up Null Hypothesis and Alternate Hypothesis.

H0 : u ≤ 2 (Null Hypothesis)

H1 : u > 2 (Right Tailed test)(Alternate Hypothesis)

Step 2: Calculate Test Statistics

As sample size id is more than 30, So we calculate Z statistics.

u = 2 (Population Mean)

x̅ = 2.1 (Sample Mean)

σ = 0.25 (Population Standard Deviation)

n = 35 (Sample Size)

SE = σ/sqrt(n) = 0.0422 (Sample standard deviation)

Z = ( x̄ — u )/SE (Z Score)

Step 3: Calculate Critical value for significance level 0.05 or Confidence level 95%

Zα = Z * 0.05 = 1.644 (Critical Value)

Step 4: Compare the Test Statistics(In this case Z statistics) with Critical value and conclude the test.

Test Statistics(in this case Z Statistic) > 1.644 (Critical Value)

So, Z is significant and the Null Hypothesis(H0) is rejected.

Therefore, at 5% or 0.05 significance level, the claim of atmost 2 grams of saturated fat in chips packet should be rejected.

Continue Learning — Never Stop Learning

This was a simple tutorial on Hypothesis Testing which is an important part of modelling.

I will be writing more posts in future too. Follow me at Medium. Do provide with a feedback or criticism and I can be reached on Twitter.

Thank you for reading :)

I hope you liked it and learned something new.

--

--

Dhruv Aggarwal
Analytics Vidhya

Engineering @HCLTech, xData Science @ProjectPro, Research @OpenMined, 3x Community Builder for ML @AWS, @Kaggle 3x Expert, 3x AWS Certified, Mentor, Speaker