Experiments in regression #1 — treat linear regression as a baseline

Alex Kirkup
6 min readJan 24, 2024

--

I have known and used ordinary least squares regression for many years now since learning it age 16–17 at college in the UK. This was back in the days before I had any coding knowledge and had to implement it entirely with a calculator.

(Of course, looking back I should have used a spreadsheet. But never mind…)

Indeed because of the ease of implementation, for example in Python using Scikit Learn’s LinearRegressor(), linear regression is an easy go-to today for any regression project. It can be applied as single linear regression with one feature, or multiple linear regression when there are multiple features.

There are two main goals in applying linear regression:

  1. As with any regression model, one outcome is being able to make an accurate prediction for the target variable.
  2. In the case of using multiple features, the relative impact or weight of each feature upon the target variable within a multiple linear regression is easy to understand: the model has high explainability.

However linear regression alone is inadequate in many regression projects because there are various factors which get in the way of achieving either goal 1 or 2 effectively.

In this article I am going to talk about 4 reasons why linear regression is should not be considered the goal, but rather as the start of a process of developing an accurate regression model. The accuracy and explainability of linear regression become the baseline by which you judge the value of any further refinements in your modelling.

I will expand upon the refinements in future posts in this series — and I outline the steps I will take at the end of this article.

My only assumption is that you know that you want to implement a regression analysis and why. In the future articles I will be implementing a least squares regression in Python using Sci-kit Learn’s LinearRegressor() as my starting point, my baseline, and then refining this through a number of steps such as feature selection, model selection, reducing skew, normalization and standardization.

Now for the 4 reasons why you should be wary when applying linear regression.

Potential problem 1: Linear regression assumes linearity (it’s in the title!)

Linear regression is not designed to capture a non-linear relationship. If your data is related as per the ‘nonlinear association’ above, we will need either some smart feature engineering or to apply a non-linear model.

I probably don’t need to say it, but this is the most important limitation of the linear regression model.

Explore Seaborn pairplots to uncover non-linear relationships between variables: https://www.geeksforgeeks.org/python-seaborn-pairplot-method/

Potential problem 2. Multiple linear regression depends on features that are not correlated to one another.

You can see this in a correlation matrix like the one below.

If we were trying to predict ‘Stress’ from the above dataset (i.e. Stress is the target variable), we would run into the problem that the features ‘Weight’ and ‘BSA’ have a high correlation. Is one calculated using the other? Is one caused by the other?

This is called multicollinearity, and although it does not affect the accuracy of prediction, it does affect how well we can understand the importance of each feature within the model. It undermines explainability.

Indeed, as a textbook might say, the features we use must truly be independent variables.

You can test for multicollinearity using the VIF (Variance Inflation Factor). See here: https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/

Potential problem 3. Linear regression depends on each feature’s data points being independent of previous data points.

Or as Investopedia puts it: “the degree of similarity between a given time series and a lagged version of itself over successive time intervals”.

In other words we’re talking about the repetition of a previous pattern, for example in a sine wave:

This is called autocorrelation and it reduces both the accuracy of prediction and the in linear regression and also the explainability of the weights applied to each feature.

You can test for autocorrelation using the Durbin-Watson test: https://www.geeksforgeeks.org/statsmodels-durbin_watson-in-python/

Potential problem 4: Linear regression depends on constant variance.

Have a look at the following graphs:

Constant variance, or homoscedasticity, is shown in the left-hand feature. However compare this with the situation of heteroscedasticity on the right: here the variance in the residuals changes. Statistics with Jim explains it perfectly: “heteroscedasticity means unequal scatter”.

This unequal scatter, or heteroscedasticity, undermines an ordinary least squares method which depends on calculating the distance between each data point and the line of regression being estimated. It reduces predictive accuracy and model explainability in linear regression, and needs to be avoided.

Heteroscedasticity can of course be visualised on a scatter graph displaying one feature and the target variable (e.g. a Seaborn pairplot for multiple regression), or a scatter graph relating the target and the prediction (where the distance between the line of regression and the data point is called the residual).

You can also test for it using a Breusch-Pagan Test: https://www.geeksforgeeks.org/how-to-perform-a-breusch-pagan-test-in-python/

The solution: Treat linear regression as your starting point.

Many online regression projects on Kaggle, Medium and GitHub begin and end with linear regression, but as we have seen there are some good reasons to doubt the accuracy and explainability of a linear regression model.

If you’re like me and linear regression is your go-to, then that’s fine. We had to start somewhere. Just don’t stop there too!

Instead, begin with a linear regression using all features.

Take a measure of the accuracy you achieve (which could be mean squared error, mean absolute error, R squared score, depending on what you’re attempting to measure).

See what the coefficients are telling you about the importance of each feature in your dataset, and see if they compare to what you could see in your data visualisation.

And then use this as your baseline when you begin to refine your model. Does your refinement improve or reduce the accuracy or explainability? Use the original baseline to make a decision about whether or not your refinement was effective.

Next steps

I am going to do exactly this and begin a regression project using linear regression, keeping a track of my progress in a series of posts starting here.

My aim is not to follow a textbook formula in what I do, but be pragmatic and experimental in my approach. Reading what should be done and in what order is deeply important during the learning process, but it is only a guide, and the uniqueness of a dataset or circumstance means there is often no single correct path to take. Instead, getting value from a dataset can feel like you are in a room with many doors, and the only way to know which one to take is to walk through it and see what happens.

Which of course is the point of having a baseline — because measuring against your baseline you will know whether your step was worth taking or not.

And anyway, I am much better at learning by doing than learning by being told what to do…

With this approach in mind, then, I am going to try to improve my regression using the following steps:

  • feature selection
  • model selection
  • addressing skew
  • standardization and normalization

So if any of this sounds interesting, watch this space.

To learn more about the assumptions of linear regression, they are well summarised here: https://towardsdatascience.com/assumptions-of-linear-regression-fdb71ebeaa8b

--

--

Alex Kirkup

Data Analytics Lead and Maths Teacher at a thriving UK secondary school. Python+Django / SQL / Excel / Statistics. https://github.com/alex-kirkup/portfolio