# The Pitfalls of Linear Regression and How to Avoid Them

## What to Do When the Linear Regression Assumptions Don’t Hold

You can always spot a data science newbie by the speed with which they jump to fitting a neural network.

Neural networks are cool and can do awesome things that, for many of us (myself included), are the reason why we got into data science in the first place. I mean, who goes into data science to play around with daggy old linear regression models?

Yet, the irony of the situation is that, unless you are working in a specialist field, like computer vision or natural language processing, a lot of the time, simple models, like linear regression, actually provide a better solution to your problem than complex black box models, like neural networks and support vector machines.

After all, linear regression models are:

- fast to train and query;
- not prone to overfitting and make efficient use of data, so can be applied to relatively small datasets; and
- are easy to explain, even to people from a non-technical background.

I’ve heard senior data scientists, with experience working with cutting edge AI, sing the praises of linear regression for these very reasons.

One of the few downsides of linear regression, however, are the strict assumptions that underlie it. These assumptions can make linear regression models unsuitable for use in a range of very common circumstances.

Fortunately, linear regression has been around for so long (since the early 19th century, to be precise) that statisticians have long ago found a way of getting around any assumption violations when they do occur, while still preserving many of the advantages associated with linear regression.

Yet, in order to deal with the situation where one or more linear regression assumptions is being violated, you first need to be able to identify when such a violation is occurring. This means understanding the assumptions at the heart of linear regression.

# Linear Regression Assumptions

Linear regression is underpinned by five key assumptions that all need to hold for the model to produce reliable predictions. Specifically:

**Linearity**: the relationship between the input and output variables is linear. That is, we express the expected value of our output variable, Y, as a linear combination of our input variables, X₁, …, Xₖ: E(Y) = b₀+ b₁X₁ + b₂X₂ + … + bₖXₖ, where b₀, b₁, …, bₖ denote the model parameters to be fitted.**No Multicollinearity**: none of the input variables, X₁, …, Xₖ, are highly positively or negatively correlated with one another;**Normality**: observations of the output variable, Y, are assumed to be drawn from the same normal distribution. That is, Y ~ iid N(m, s²), where m = E(Y) and s² denotes the variance of Y;**Homoskedasticity**: the variance, s², of the output variable, Y, is assumed to be constant, regardless of the values of the input variables; and**Independence**: observations of the output variable, Y, are assumed to be independent of one another. That is, there is no*autocorrelation*present in our output variable.

If your data happens to follow all these assumptions, then good for you. Go off and fit your linear regression model, then take the rest of the afternoon off.

However, to quote the Rolling Stones, “you can’t always get what you want”, and if you’re dealing with anything more advanced than a simple “toy” dataset, odds are at least one of these assumptions will be violated.

At this point, you have two options (a) sulk, or (b) find a way around whatever assumption has been breached.

Assuming you choose to go with Option B, then here are four ways to get around a violation of one of the linear regression assumptions.

# Solution #1: Remove Input Variables to Deal with Multicollinearity

One of the simplest problems to identify and deal with is multicollinearity. As a general rule, if two input variables have an (absolute) correlation coefficient greater than 0.8, there’s a good chance you have a multicollinearity problem.

This can make it difficult to interpret the model coefficients, and to determine their statistical significance, because the model splits the impact of what, for all practical purposes, is really the one variable under two different names, across two separate input variables.

For example, consider a dataset, such as this one from Kaggle, which gives the height (in inches), gender and weight (in pounds) of 10,000 individuals.

Fitting a model to this dataset, using Python’s statsmodels package yields the following fitted parameters:

import pandas as pd

import statsmodels.formula.api as smf# Load data

multi_data = pd.read_csv('weight-height.csv')# Fit linear regression model to data

multi_model = smf.ols(formula='Weight ~ Height + C(Gender)', data=multi_data).fit()# Print summary statistics

multi_model.summary()

Suppose we create a new variable, Height_cm, which gives the height of each individual in centimetres (which is clearly, 100% correlated with the original Height variable, since height in centimetres = height in inches/2.54), and refit our model to include this new variable (in addition to the original variables).

multi_model2 = smf.ols(formula='Weight ~ Height + C(Gender) + Height_cm', data=multi_data).fit()multi_model2.summary()

We find that the coefficient of height from the first model is now split between Height and Height_cm (in can be verified that 5.9769 = 5.1748 + 2.0373/2.54), affecting the interpretability of the coefficients of both variables.

The simplest solution to a multicollinearity problem is to just drop one of the highly correlated input variables (it doesn’t matter which one) from the model.

# Solution #2: Use Feature Engineering to Deal with Non-Linearity

Linear regression works by essentially fitting a (straight) line of best fit through your data. This is all well and good in the simple “toy” examples they routinely give you in introductory statistics classes, but what happens if your data looks something like the blue dots in this chart?

import numpy as np

import matplotlib.pyplot as plt# Create dataset

np.random.seed(1)

x = np.random.uniform(-10, 10, 200)

y = x + x**2 + np.random.normal(0, 5, 200)nl_data = pd.DataFrame({'X':x, 'Y':y})# Fit linear regression to data

nl_model = smf.ols(formula='Y ~ X', data=nl_data).fit()# Plot data and regression line

plt.scatter(x, y, alpha=0.5)

plt.xlabel('x')

plt.ylabel('y')

plt.plot(nl_data['X'], nl_model.predict(nl_data['X']), color = 'red')

plt.show()

Fitting a straight regression line to this data (the red line) will result in the overprediction of the output variable (y) for input variable (x) values in the middle of the range under consideration, and underprediction for x values at either extreme of the range.

To capture the true structure of this data, what we really need to do is to fit a polynomial curve to our data, but that can’t be done within the constraints of linear regression, can it?

Well, actually, by engineering new features that are functions of our existing input variables (including powers, logs, and products of pairs of variables), it *is* possible to use linear regression to fit something other than a straight line to our data.

For example, in the above example, we could create a new variable, z = x² and then fit our linear regression model using both x and z as our input variables. Our resulting model will have the form:

E(Y) = b₀+ b₁x + b₂z = b₀+ b₁x + b2x²

which is still a linear combination of our input variables, x and z. However, since z is just a function of x, we can plot our fitted regression line in two dimensions, and will get something like this:

# Create engineered variable

nl_data.loc[:, 'X2'] = nl_data['X'].apply(lambda x: x**2)# Refit model to data, including the new variable

nl_model2 = smf.ols(formula='Y ~ X + X2', data=nl_data).fit()plt.scatter(x, y, alpha=0.5)# Plot the data and the new regression line

plt.xlabel('x')

plt.ylabel('y')

plt.plot(nl_data['X'], nl_model2.predict(nl_data[['X', 'X2']]), color = 'red')

plt.show()

As an aside, this example illustrates the importance of plotting your data before trying to fit a model to it, since, through visualizing our data, you can get an idea of the sorts of features it would be beneficial to engineer.

# Solution #3: Use Variable Transformation or a Generalized Linear Model to Deal with Non-Normality and Heteroskedasticity

Linear regression assumes that our output variable is drawn from a normal distribution. That is, it is symmetric, continuous and defined over the entire number line.

In practice, violations of the last two characteristics aren’t such a big deal. For instance, in the height vs weight example given above, even though human weights typically only fit into a relatively narrow range and can’t be negative, we can still fit a linear regression to the data without much concern.

However, if our data is skewed, then that can lead to other violations of our linear regression assumptions, if we don’t correct it.

For example, consider a dataset (gamma_data) with positively skewed values of the output variable, Y, as plotted below:

# Create gamma dataset

np.random.seed(1)

x1 = np.random.uniform(-1, 1, 200)

x2 = np.random.uniform(-1, 1, 200)mu = np.exp(1 + x1 + 2*x2 + np.random.randn())

y = np.random.gamma(shape = 2, scale = mu/2, size = 200)

gamma_data = pd.DataFrame({'X1':x1, 'X2':x2, 'Y':y})# Plot data

plt.hist(gamma_data['Y'], bins=30)

plt.ylabel('Count')

plt.xlabel('Y')

plt.show()

Fitting a linear regression model to this dataset without first transforming our output variable, then plotting the residuals against the fitted values of our output variable gives us the following residual plot:

# Fit linear regression

non_norm_model = smf.ols(formula='Y ~ X1 + X2', data=gamma_data).fit()# Calculate residuals

resid = gamma_data['Y'] - non_norm_model.predict(gamma_data[['X1', 'X2']])# Plot residuals

plt.scatter(non_norm_model.predict(gamma_data[['X1', 'X2']]), resid, alpha=0.5)

plt.xlabel('fitted')

plt.ylabel('residual')

plt.show()

The spread of the points on this scatterplot increase moving from left to right, which is a clear sign of heteroskedasticity (that is, a violation of the homoskedasticity assumption).

To deal with this issue, we can transform our output variable to make it symmetric prior to fitting our model, for example, by taking its log or square root, as shown below:

# Transform Y by taking the log of it

gamma_data.loc[:, 'log_y'] = gamma_data['Y'].apply(lambda x: np.log(x))# Refit linear regression to transformed data

non_norm_model2 = smf.ols(formula='log_y ~ X1 + X2', data=gamma_data).fit()# Calculate residuals

resid = gamma_data['log_y'] - non_norm_model2.predict(gamma_data[['X1', 'X2']])# Plot residuals

plt.scatter(non_norm_model2.predict(gamma_data[['X1', 'X2']]), resid, alpha=0.5)

plt.xlabel('fitted')

plt.ylabel('residual')

The residuals of this model now demonstrate constant spread, indicating homoskedasticity.

Alternatively, we can fit a model that is specifically designed for non-normal data, such as a generalized linear model (GLM). I discuss GLMs in detail in my article Beyond Linear Regression: An Introduction to GLMs.

In the case of positively skewed data, a gamma GLM is typically the best choice.

# Fit GLM to data

gamma_model = sm.GLM(gamma_data['Y'], sm.add_constant(gamma_data[['X1', 'X2']]), family=sm.families.Gamma(sm.genmod.families.links.log)).fit()# Calculate residuals

resid = gamma_model.resid_deviance# Plot residuals

plt.scatter(gamma_model.predict(sm.add_constant(gamma_data[['X1', 'X2']]),), resid, alpha=0.5)

plt.xlabel('fitted')

plt.ylabel('residual')

plt.show()

Fitting a gamma GLM with a log link to our data, also addresses the heteroskedasticity problem identified previously.

# Solution #4: Use a Time Series Model to Deal with Autocorrelation

If you are dealing with data collected at regular intervals of time, that is, time series data (for example, weather or financial data), then odds are your data is autocorrelated. Autocorrelation means that knowing the value of one observation of your output variable can give you an indication of the value of the previous or next observation.

For example, in the context of weather data, if you knew today’s maximum temperature for your city, then you would have a rough idea of tomorrow’s maximum temperature in the same location (since the temperature generally doesn’t vary too much from day to day).

The easiest way to detect whether autocorrelation is present in your data is to plot the output variable against time.

Consider the S&P 500 dataset sourced from Kaggle. If we focus on a single stock (with stock exchange code “AAL”) and plot the closing price against time for the duration of the dataset we get:

# Read in dataset

sandp_data = pd.read_csv('all_stocks_5yr.csv')# Only keep data for AAL

sandp_data = sandp_data[sandp_data['Name'] == 'AAL']# Transform date to datetime format

sandp_data['date'] = pd.to_datetime(sandp_data['date'])# Sort data by date

sandp_data.sort_values(by = ['date'], inplace = True)# Plot data

plt.plot(sandp_data['date'], sandp_data['close'])

plt.show()

From this plot, we can see that, clearly, there is autocorrelation present in this data, which a standard linear regression model can’t handle.

If we did try to fit a linear regression model to this data, using Year and Month as our input variables, we would end up with the red line shown below, which provides a less than ideal fit to our data:

# Create year and month variables

sandp_data['Year'] = sandp_data['date'].map(lambda x: x.year)

sandp_data['Month'] = sandp_data['date'].map(lambda x: x.month)# Fit linear regression to data

sandp_model = smf.ols(formula='close ~ Year + C(Month)', data=sandp_data).fit()# Plot data with regression line overlay

plt.plot(sandp_data['date'], sandp_data['close'])

plt.plot(sandp_data['date'], sandp_model.predict(sandp_data[['Year', 'Month']]), color = 'red')

plt.show()

In order to allow for the autocorrelation in this dataset, we need to use a model which is specifically designed for dealing with time series, such as an ARIMA model.

Given output variable at time t, Y(t), an ARIMA(p, d, q) model first takes the difference between consecutive observations of our output variable d successive times to produce a transformed output variable, y(t). This is done in order remove any long-term trends from the data, such as the increasing trend in the plot shown above.

For example, if d = 1, y(t) = Y(t) — Y(t-1), if d = 2, y(t) = z(t) — z(t — 1), where z(t) = Y(t) — Y(t — 1), and so on.

Once we have taken the d difference of our data, we then model the resulting transformed output variable as a linear combination of the p immediately prior observations of y(t) and the q immediately prior model residual values (that is, the difference between the actual and predicted values of y(t)).

There is a lot of theory around how to set appropriate values for the parameters p, d and q, which is beyond the scope of this article.

For the sake of this example, lets assume d = 1, p = 5 and q = 0. That is, after differencing once, to allow for the overall increasing trend in our data, the differenced output values can be modelled as a linear combination of the five immediately prior (differenced) output values.

This model can be fitted using the tsa.ARIMA function in the Python statsmodels package and produces the following fitted model:

# Index dataset by date

sandp_data2 = sandp_data.set_index('date')# Fit ARIMA model to the data

ts_model = sm.tsa.ARIMA(sandp_data2['close'], (5, 1, 0)).fit()# Plot actual vs fitted values

ts_model.plot_predict(dynamic=False)

This model tracks our data extremely well, with the forecast and actual lines tracking so closely that it is almost impossible to distinguish between the two.

As an aside, it should be noted that this strong performance is in part due to the fact that we are predicting within the sample we used to fit our data, and such performance is unlikely to continue out of sample.

The Occam’s razor principle states that if “there exists two explanations for an occurrence, the one that requires the least speculation is usually correct”.

In the context of data science, this means that the more complicated a model is, the less likely it is to explain your data, or in practical terms, if you are trying to model a dataset, then start with the simplest model first and only move on to considering more complicated models if the simpler models prove to be unsuitable.

For a regression problem, more often than not, the simplest model you can start with is a linear regression model. However, in many cases, violation of one or more of the strict linear regression assumptions can make using this model inappropriate.

In this article we have provided several work arounds for the linear regression assumptions that can allow you to continue to use this highly versatile and easily understood model (or related models, such as GLMs or time series models), before moving on to resource hungry black box techniques, such as neural networks.

Linear regression models may not be cool, but they have a proven track record, and as a data scientist, that’s all that should really matter.

# About the Author

I am a data scientist with over 15 years’ experience working in the data industry. I have a PhD in Statistics from the Australian National University and a Masters in Computer Science (Machine Learning) from Georgia Tech. You can connect with me on LinkedIn here.