The Role of Hypothesis Testing in Linear Regression

Arpit Srivastava
5 min readMar 17, 2020

--

I recommend you go through this article before moving forward.

Alright! Now that we understand what Linear Regression is, let’s try to answer two basic questions before moving forward:

  1. What is Hypothesis testing?
  2. Why do we need it?

Hypothesis Testing

Everyone has watched at least on courtroom drama in their lifetime. All of us know about the presumption of innocence. The presumption of innocence is the legal principle that one is considered innocent until proven guilty.

Let me tell you a story. There’s an allegation against John Doe that he has committed a crime. He stands before the judge. The role of his lawyer is to prove that he is not guilty. The role of the public prosecutor to prove that he is guilty. To the judge, John Doe is not guilty unless there’s evidence to support his guilt.

So we have two hypotheses,

The null hypothesis (Ho) : John Doe is not guilty

The Alternate Hypothesis(Ha) : John Doe is guilty.

Let’s just say, you are the public prosecutor . What is your job? You have to gather enough evidence so that the Judge can reject the null hypothesis with some amount of confidence. This is exactly what we do when we test a hypothesis.

When we conduct a hypothesis test, the first thing we do, is choosing a confidence level. Generally, a 95% confidence level is used the most. That brings us to another question, why not 100%?

To answer this, we need to know what confidence intervals are.

Confidence Intervals

What is the basic idea behind inferential statistics? Let’s say you are trying to conduct an analysis on the heights of the students in your county. Is it possible to get all the data? With extreme difficulty, it is. Are we willing to put in all the time, effort and money to gather that data? In most cases, no.

So what we do instead is get a representative sample of the population. The population ,in this case, includes all people in your county. In an ideal scenario, a representative sample would a sample of the population which includes all height groups according to their distribution. It’s not an easy thing to do. So, mostly we just take random samples.

Your patience is about to pay off.

In hypothesis testing, a 95% confidence level would tell you that we are 95% confident that the population mean lies within a range a.k.a the confidence interval. How does it help us? It gives a probability of being right, which is the best we can hope for, given that we are not sure of how representative our sample is.

Coming back to the question, why not just use a 100 percent confidence level? At 100 percent confidence level, our confidence interval is (-∞ ,+∞). Does that tell us anything? We know that the population mean has to lie between -∞ and ∞. That’s why we take a confidence level like 95% so that our confidence interval is wide enough to give us a good idea of where, in that confidence interval, our population mean lies.

Behold the p-value

Since, we are dealing with probabilities, there’s a high chance that we make an error. In the courtroom drama, what are the errors that we can make?

Type 1 error:

What if we conclude that John Doe is guilty when he’s actually not guilty?(Reject null hypothesis when it’s actually true.)

Type 2 error:

What if we claim that John Doe is not guilty when he’s actually guilty (Fail to reject the null hypothesis when it’s actually false)

To deal with the Type 1 error, the p-value came into existence. The p-value is a probability telling you whether you can reject the null hypothesis or not given a confidence interval.

When we conduct a hypothesis test using a statistical software, it gives you a p-value along with the confidence level.The “α” level is the probability of rejecting the null hypothesis when the null hypothesis is true. How do we calculate alpha? If our confidence level is 95%, α would be 1-confidence level , which means α is 5% in this case.

We compare our p-value to α. If the p-value is less than α, we can safely reject the null hypothesis. If not, we fail to reject the null hypothesis.

Word of caution: This,

does NOT work.

If you want to read more about how hypothesis testing is done (the math behind it), pick up an undergraduate level stats book. If you don’t want to do that, let me know in the comment section, I’ll write another article explaining Hypothesis testing in detail.

OK! But what does it have to do with Linear Regression?

In the previous article, I talked about the basic principle behind Linear regression. Let’s say we have come up with an equation. What next? Are we sure that our data is actually fit for a technique like linear regression? How can we be sure?

Hypothesis testing to the rescue.

Remember the main components of the equation? β0 was the intercept and β1 was the slope. The hypothesis test is done for both the intercept and the slope to check whether their value is 0 or not. Let’s see how it looks like for the slope.

For the slope we do a hypothesis test where,

The null hypothesis (Ho) : β1 = 0, and

The Alternate Hypothesis(Ha) : β1 ≠ 0

The confidence level in this case is 95%. So α=0.05. We compute the p-value of the test statistic and check whether we can reject the null hypothesis or not. If we can, then we are 95% confident that the the slope and intercept have a non-zero value.

I hope you liked the article. The next one is going to be about the assumptions of Linear regression. See you soon.

--

--

Arpit Srivastava

Senior data scientist, AI Enthusiast, Music composer, guitarist and Singer