Hypothesis Testing On Linear Regression

Ankita Banerji
Nerd For Tech
Published in
5 min readMay 14, 2021

--

When we build a multiple linear regression model, we may have a few potential predictor/independent variables. Therefore, it is extremely important to select the variables which are really significant and influence the experiment strongly. To get the optimal model, we can try all the possible combinations of independent variables and see which model fits best. But this method is time-consuming and infeasible. Hence, we need another method to get a decent model. We can do the same either by manual feature elimination or by using any automated approach (RFE, Regularization, etc.).

In manual feature elimination, we can:

  • Build a model with all the features,
  • Drop the features that are least helpful in prediction (high p-value),
  • Drop the features that are redundant (using correlations and VIF),
  • Rebuild the model and repeat.

It is generally recommended that we follow a balanced approach, i.e., use a combination of automated (coarse tuning) + manual (fine tuning) selection in order to get an optimal mode. In this blog we will discuss the second step of manual feature elimination i.e., Drop the features that are least helpful in prediction (insignificant features).

First question that arises is: ‘What do we mean by significant variable?’. Let us understand it in Simple Linear Regression first.

When we fit a straight line through the data, we get two parameters i.e., the intercept (β₀) and the slope (β₁).

Now, β₀ is not of much importance right now, but there are a few aspects around β₁ which needs to be checked and verified. Suppose we have a dataset for which the scatter plot looks like the following:

Scatter Plot

When we run a linear regression on this dataset in Python, Python will fit a line on the data which looks like the following:

We can clearly see that the data in randomly scattered and doesn’t seem to follow linear trend. Python will anyway fit a line through the data using the least squared method. We can see that the fitted line is of no use in this case. Hence, every time we perform linear regression, we need to test whether the fitted line is a significant one or not (in other terms, test whether β₁ is significant or not). We will use Hypothesis Testing on β₁ for the same.

Steps to Perform Hypothesis testing:

  1. Set the Hypothesis
  2. Set the Significance Level, Criteria for a decision
  3. Compute the test statistics
  4. Make a decision

Step 1: We start by saying that β₁ is not significant, i.e., there is no relationship between x and y, therefore slope β₁ = 0.

Step 2: Typically, we set the Significance level at 10%, 5%, or 1%.

Step 3: After formulating the null and alternate hypotheses, next step to follow in order to make a decision using the p-value method are as follows:

  1. Calculate the value of t-score for the mean on the distribution.

Where, μ is the population mean and s is the sample standard deviation which when divided by √n is also known as standard error.

2. Calculate the p-value from the cumulative probability for the given t-score using the t-table

3. Make the decision on the basis of the p-value with respect to the given value of significance level.

Step 4: Making Decision

If ,

p-value < 0.05, we can reject the null hypothesis.

p-value>0.05, we fail to reject the null hypothesis.

If we fail to reject the null hypothesis that would mean β₁ is zero (in other words β₁ is insignificant) and of no use in the model. Similarly, if we reject the null hypothesis, it would mean that β₁ is not zero and the line fitted is a significant one.

NOTE: The above steps are performed by Python automatically.

Similarly in multiple linear regression, we will perform the same steps as in linear regression except the null and alternate hypothesis will be different. For the multiple regression model :

Example in Python

Let us take housing dataset which contains the prices of properties in the Delhi region. We wish to use this data to optimise the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Top five rows of dataset look something like this:

Housing dataset

After preparing, cleaning and analysing the data we will build a linear regression model by using all the variables (Fit a regression line through the data using statsmodels)

import statsmodels.api as smy_train = housing_dataset.pop('price');X_train = housing_dataset;X_train_lm = sm.add_constant(X_train)lm = sm.OLS(y_train, X_train_lm).fit()print(lm.summary())

We get the following output:

Looking at the p-values (P>|t|), some of the variables like bedrooms, semi-furnished aren’t really significant (p>0.05). We could simply drop the variable with the highest, non-significant p value.

Conclusion: Generally we use two main parameters to judge the insignificant variables, the p-values and the VIFs(variance inflation factor).

--

--