A Beginner’s Guide to Stepwise Multiple Linear Regression

Achyut Ramkumar
Analytics Vidhya
Published in
8 min readJun 10, 2020

Almost every data science enthusiast starts out with linear regression as their first algorithm. Though it might look very easy and simple to understand, it is very important to get the basics right, and this knowledge will help tackle even complex machine learning problems that one comes across. In this article, we will discuss what multiple linear regression is and how to solve a simple problem in Python.

What is Multiple Linear Regression?

You would have heard of simple linear regression where you have one input variable and one output variable (otherwise known as feature and target, or independent variable and dependent variable, or predictor variable and predicted variable, respectively).

In multiple linear regression, you have one output variable but many input variables.

The goal of a linear regression algorithm is to identify a linear equation between the independent and dependent variables. This equation will behave like any other mathematical function, where for any new data point, you can provide values for inputs and will get an output from the function.

In linear regression, the input and output variables are related by the following formulae:

Source: SuperDataScience

Here, the ‘x’ variables are the input features and ‘y’ is the output variable. b0, b1, … , bn represent the coefficients that are to be generated by the linear regression algorithm.

How does a linear regression algorithm work?

Let us understand this through a small visual experiment of simple linear regression (one input variable and one output variable). Here, we are given the size of houses (in sqft) and we need to predict the sale price.

We have sample data containing the size and price of houses that have already been sold. On plotting a graph between the price of houses (on Y-axis) and the size of houses (on X-axis), we obtain the graph below:

Source: 365DataScience

We can clearly observe a linear relationship existing between the two variables, and that the price of a house increases on increase in size of a house. Now, our goal is to identify the best line that can define this relationship.

The algorithm starts by assigning a random line to define the relationship. This equation will be of the form y = m*x + c. Then, it calculates the square of the distance between each data point and that line (distance is squared because it can be either positive or negative but we only need the absolute value). Let us call the square of the distance as ‘d’.

The value of ‘d’ is the error, which has to be minimized. If the line passes through all data points, then it is the perfect line to define the relationship, and here d = 0. However, in most cases, we’ll have some residual error value for ‘d’ as the line will not pass through all points.

After multiple iterations, the algorithm finally arrives at the best fit line equation y = b0 + b1*x. This is the simple linear regression equation. This is called the Ordinary Least Squares (OLS) method for linear regression. Shown below is the line that the algorithm determined to best fit the data.

Source: 365DataScience

So, now if we need to predict the price of a house of size 1100 sqft, we can simply plot it in the graph and take the corresponding Y-axis value on the line.

Feature Selection

When given a dataset with many input variables, it is not wise to include all input variables in the final regression equation. Instead, a subset of those features need to be selected which can predict the output accurately. This process is called feature selection.

Feature selection is done to reduce compute time and to remove redundant variables. Let us understand this through an example. We are supposed to predict the height of a person based on three features: gender, year of birth, and age. Here it is very obvious that the year of birth and age are directly correlated, and using both will only cause redundancy. So, instead we can choose to eliminate the year of birth variable. Now, we predict the height of a person with two variables: age and gender. This also reduces the compute time and complexity of the problem.

Stepwise Regression

Stepwise regression is a technique for feature selection in multiple linear regression. There are three types of stepwise regression: backward elimination, forward selection, and bidirectional elimination. Let us explore what backward elimination is.

Backward elimination is an iterative process through which we start with all input variables and eliminate those variables that do not meet a set significance criterion step-by-step.

First, we set a significance level (usually alpha = 0.05). Second, we perform multiple linear regression with the features and obtain the coefficients for each variable. Third, we find the feature with the highest p-value. Fourth, we check if p-value > alpha; if yes, we remove the variable and proceed back to step 2; if no, we have reached the end of backward elimination.

Through backward elimination, we can successfully eliminate all the least significant features and build our model based on only the significant features.

Multiple Linear Regression with Backward Elimination — Sample Problem

In multiple linear regression, since we have more than one input variable, it is not possible to visualize all the data together in a 2-D chart to get a sense of how it is. However, Jupyter Notebooks has several packages that allow us to perform data analysis without the dire necessity to visualize the data.

Here, we have been given several features of used-cars and we need to predict the price of a used-car. Let us get right down to the code and explore how simple it is to solve a linear regression problem in Python!

We import the dataset using the read method from Pandas. We can observe that there are 5 categorical features and 3 numerical features. Price is the output target variable.

We proceed to pre-process the data by removing all records containing missing values and removing outliers from the dataset. We also remove the Model feature because it is an approximate combination of Brand, Body and Engine Type and will cause redundancy.

However, we have run into a problem. The numerical features do not have a linear relationship with the output variable.

This problem can be solved by creating a new variable by taking the natural logarithm of Price to be the output variable. This is one of many tricks to overcome the non-linearity problem while performing linear regression.

While Year and Engine Volume are directly proportional to Log Price, Mileage is indirectly proportional to Log Price.

Next, we have several categorical variables (variables that do not have numerical data point values) which need to be converted to numerical values since the algorithm can only work with numerical values. So here, we use the concept of dummy variables. Since it is a separate topic on its own, I will not be explaining it in detail here but feel free to pause reading this article and google “dummy variables”. Once you’ve understood the intuition, you can proceed further.

The next step is Feature Scaling. We will be scaling all the numerical variables to the same range, i.e. converting the values of numerical variables into values within a specific interval. This is done to eliminate unwanted biases due to the difference in values of features. For example, the Year variable has values in the range of 2000 whereas the Engine Volume has values in the range of 1–5. So, if they are not scaled, the algorithm will behave as if the Year variable is more important (since it has higher values) for predicting price and this situation has to be avoided.

Source: GeeksForGeeks

This formula will be applied to each data point in every feature individually. We use the StandardScaler object from the Scikit-Learn library, and scale the values between -1 and +1.

Next, we split the dataset into the training set and test set to help us later check the accuracy of the model.

Upon completion of all the above steps, we are ready to execute the backward elimination multiple linear regression algorithm on the data, by setting a significance level of 0.01.

The Statsmodels library uses the Ordinary Least Squares algorithm which we discussed earlier in this article.

reg.summary() generates the complete descriptive statistics of the regression. It was observed that the dummy variable Brand_Mercedes-Benz had a p-value = 0.857 > 0.01. This variable was thus eliminated and the regression was performed again. Next, we observed that Engine-Type_Other has a p-value = 0.022 > 0.01. This variable is eliminated and the regression is performed again. Now, we can clearly see that all features have a p-value < 0.01. This brings us to the end of our regression.

P>|t| is the column containing p-values of all features

Now we have a regressor object that fits the training data. The test data values of Log-Price are predicted using the predict() method from the Statsmodels package, by using the test inputs.

Now comes the moment of truth! We need to check to see if our regression model has fit the data accurately. To do so, we plot the actual values (targets) of the output variable “Log-Price” in the X-axis and the predicted values of the output variable “Log-Price” in the Y-axis. And voila! We can see that they have a linear relationship that resembles the y = x line.

The orange line is the y = x line

Hence, it can be concluded that our multiple linear regression backward elimination algorithm has accurately fit the given data, and is able to predict new values accurately.

Conclusion

This is just an introduction to the huge world of data science out there. I consider myself a beginner too, and am very enthusiastic about exploring the field of data science and analytics.

This is my first article on this platform, so be kind and let me know any improvements I can incorporate to better this article.

Until next time, cheers!

--

--

Achyut Ramkumar
Analytics Vidhya

Management Consulting and Data Science Enthusiast. Student at NIT Trichy.