Not So Simple Linear Regression: Multiple Linear Regression

Published in

CodeX

6 min readNov 6, 2022

Simple linear regression is great when we’re interested in how one independent variable affects the dependent variable, but what do we do when we have more than one independent variable? Say we are interested in house prices, then there are quite a few different features that may play a role in determining its value: a great location, a certain number of bedrooms, affluency of a neighbourhood, distance to schools to name just a few. They all might have linear relationships with house prices, and we may want all these features to play their part in our regression model. This is when multiple linear regression is great.

There are lots of advantages to multiple linear regression. If we had a classification problem, then we can’t really use a linear regression model to draw conclusions on our dataset, but if we are planning on using machine learning algorithms, we can use multiple linear regression to see if one or more of the features are highly correlated with one another. If they are, we can always drop one of them before we feed our data into classification algorithms.

It is also worth noting that a heatmap is an easy visualisation method for getting an idea of correlation between our features and the target variable, and even between the different features.

Play around with the cmap to see what palette best suits your data. I find heatmaps easy to interpret, particularly with the correlations annotated. For this post, take a look at the very last column or the very bottom row. The closer the colour is to purple, the higher the correlation between that feature and house prices.

Now, we want to produce a model and come up with an equation that can predict house prices. In a previous post I discussed the differences between using Statsmodel and Scikit Learn for conducting simple linear regression.

Let’s look at the same dataset and see what multiple linear regression can do. The dataset is clean and consists of continuous data, barring the address field. As regression analysis can only be conducted on continuous numerical data, I dropped the address field. So, we can expect a model to have 5 independent variables and the house prices (‘Price’) is our dependent variable. Which means we can expect our linear regression model to take the form of:

The coefficients, a1,..,a5, describes the association between the respective independent variables and the dependent variable. Their sign, positive or negative, suggests either a positive or negative association between the two variables.

Consider the following:

We can suggest that as average area income increases, the mean house prices also tend to increase. It’s important to discuss what the coefficients actually represent: if there is one unit increase in the independent variable, so in this case the average area income increases by one, then the mean house price will increase by $a1. This is calculated whilst keeping the effects of the other features constant. So, this essentially means the coefficients can help determine the sole association between an independent variable and the dependent variable.

As I’ve mentioned in this post, Statsmodel is great at providing insight into a dataset, so I created a multiple linear regression model using Statsmodel.

There’s a question to be asked here: should our model’s linear regression line pass through the origin? There is no right or wrong answer here, it really depends on what you’re modelling. I am going to assume even the most undesirable house is not going to be free, so no, our model will not go through the origin.

So, we need to make sure we have all our necessary imports and then we are ready to go!

I am going to be using the training and testing set from this post, and feel free to check out the notebook. The X_train consists of 80% of the dataset and houses all the numerical features whilst the y_train set are the corresponding house prices.

Next, we need to make sure we add in a y-intercept to our model, as Statsmodel by default sets the intercept as (0,0). Then, we feed in our training data.

The .summary() method is great at giving us a detailed insight into the regression model.

There’s a lot of unpack here! First, let’s look at the ‘coef’ column. The first entry, the “const” represents the y-intercept. The rest of the terms in the same column represent the coefficient for each independent variable: for example, if there is one unit increase in the average area for the number of rooms in a house, then our model suggests that the price is predicted to increase by $120,500. The summary table doesn’t have any negative coefficients — this means that as all our features increase by one unit, the mean house price also increases.

Now, we must ask how sure should we be of our model? In other words, within our dataset do we consistently see this increase in prices? That’s where the second column comes into play: ‘std err’ is the standard error. The standard error essentially is an estimate of the standard deviations of the corresponding coefficients: we have 4000 data points for our training set here, so imagine the standard error as the measurement of how much the coefficient changed throughout the 4000 data points. The lower the standard error, the better the model.

So, what is our model? Our model is essentially an equation:

I have rounded everything to four significant figures, but if you look closely you’ll see that the features are all multiplied by the coefficients from our summary table.

We have a multiple linear regression model that describes the changes to mean house prices as other features change. Now what? We have already seen that there is the standard error that helps assess the variability of the coefficient amongst the 4000 data points we used, which tells us the model is not exactly perfect. In this post I touched upon the common analytical tools we have to judge linear regression: our Statsmodel .summary() method actually returns values for these, and even gives us a lot more things to think about!

To begin with, we can see the R-squared value of 0.917. This tells us that 91.7% of the variability in our dataset can be explained by our model. That’s actually pretty good. However, just as you don’t start a serious medical treatment after one bad test result, we can’t use one evaluation metric like the coefficient of determination, the R-squared value, to conclude that our model is very reliable. Statsmodel, as I’d mentioned before, is great at providing insight into our data, so it gives us even more metrics. A lot of these I’ll discuss in another post when I discuss p-values and how to interpret them, but for now, try out your own multiple linear regressions!

Not So Simple Linear Regression: Multiple Linear Regression

Written by Indrani Banerjee