Understanding Linear Regression

Machine Learning Series

Myrnelle Jover
Decision Data
10 min readJun 1, 2021

--

A still from The Curious Case of Benjamin Button.

The curious case of linear regression

This evening I re-watched “The Curious Case of Benjamin Button” — a film where the protagonist regresses in age as time passes. Amongst various philosophical thoughts, I began pondering regression in the statistical sense.

Fig 1: Timeline of events in Benjamin Button’s life. (Data source: Timetoast)

Linear regression is considered one of the simpler machine learning algorithms used to explain the relationships between predictor and response variables. In this article, we will explore the concepts, assumptions and metrics required to understand, validate and evaluate a linear regression model.

Contents

You can find example applications of linear regression in R and Python on my GitHub account.

Overview of linear regression

Linear regression is a regression technique that predicts (continuous) numerical outcomes on data with a linear relationship between the d predictor variables, X, and the response variable, y. Expressed mathematically, a regression problem takes on the form:

Eqn 1: Regression techniques use predictor variables, X, to predict a numerical response variable, y.
Fig 2: The position of linear regression in the machine learning hierarchy.

Recall that supervised learning methods take labelled data as input to predict or classify outcomes, whereas unsupervised learning methods take unlabelled data as input to discover patterns.

Fig 3: Since linear regression takes labelled data as input, it is a supervised learning method. (Data source: Boston Housing).

For a regression problem to be linear, its function, f(X), must follow a linear form with respect to the response variable, y. We call this simple linear regression when there is only one predictor variable (i.e. d=1) and multiple linear regression when there are multiple predictor variables (i.e. d≥2). Given this information, we can re-write the original formula as:

Eqn 2: Linear regression techniques have a linear function with respect to the response variable, y.

Concepts

Coefficients

The coefficients, β, multiply the predictor variables, X. Each coefficient has a sign (+/-) and magnitude (number) indicating the respective direction and strength of the relationship between a predictor, x, and the response, y. The coefficient represents the mean change in the response, y, for each unit change in x, holding all other predictor variables constant.

Variance

The variance is the sum of the squared differences between the observed response values and the mean of the response variable. The variance tells us the spread of the observed response values around their mean.

Eqn 3: The variance tells us the spread of the observed response values around their mean.

Residuals

The residual (or error component, e) of an observed value is the deviation from the value predicted by the model. In other words, the variance that is unexplained by the model.

Eqn 4: A residual is the difference between an observed and predicted response value.
Fig 4: Both the predicted response values and the residuals are calculated row-wise, corresponding to the relationship between the predictor variables, X, and the actual response values, y.

We can think of this as the vertical distance between an observed value and the regression line. Similar to coefficients, the residual has an associated sign and magnitude of the data point in relation to the regression line.

Fig 5: Residuals are the vertical distance between an observed and predicted value. Predicted values (unfilled data points) lie on the regression line, whereas observed values (filled data points) may not.

Assumptions

There are four assumptions (in no particular order) that must be satisfied before fitting a linear regression model in the standard way:

  1. Linearity: There is a linear relationship between the predictor variables and the mean of the response variable. If the relationships are not linear, then the entire premise for using linear regression is invalidated. However, a linear model can still be a good starting point and we can provide additional terms; this means that the model can still be improved by incorporating those non-linear relationships.
  2. Independence: The residuals are independent of each other. If the residuals are not independent, the residual variance will be biased and possibly seem higher than it is. This happens because we consider each observation to provide separate points of information. An extreme example of this would be a dataset containing all duplicates due to the data collection method. We can fix this by providing additional terms such that the residuals do not exhibit such a pattern conditional on these terms.
  3. Normality: The residuals are normally distributed. If this assumption is violated, then this will still be acceptable so long as the dataset is large enough and the distribution is normal enough. Otherwise, the accuracy of model coefficients and confidence intervals will be affected.
  4. Homoscedasticity: The variance of residuals is constant around the regression line. If not, the standard errors will be underestimated (i.e. the absolute values of the test statistics will be higher than they should be and the p-values will be lower than they should be).

Note: There are also non-standard methods of fitting a linear model, such as in time series analysis, which does not require an assumption of independence. In practice, other assumptions can be relaxed except for linearity, as that is the main requirement of a linear model, and transformations can be used to make the relationship linear if necessary.

These assumptions can be checked using the following plots.

Fig 6: Plots to check linear regression assumptions.

Assumption #1: Linearity

In figure 5, we observe a relationship that seems to be linear. To check that it is, we use a residual plot. The absence of a distinguishable pattern around the horizon (y = 0) indicates that the predictors, X, are linear in the response, y. In contrast, the presence of a pattern suggests that a non-linear model may be a better fit.

Fig 7: The residuals do not follow a distinguishable pattern around y=0; therefore the relationship is linear.

Assumption #2: Independence

We cannot visually nor numerically check whether the residuals are independent; instead, we must review the data collection method and ensure that it leads to independent residuals, to the best of our knowledge.

Note: In most applications, the data collection method will inform our validation of independence, so we don’t pursue this topic further at this point. However, there are exceptions including (but not limited to) time series and spatial analyses.

Assumption #3: Normality

We can use a Normal Q-Q (quantile-quantile) plot to check the residuals. We are checking that the standardised residuals correspond to the theoretical quantiles of the normal distribution. If the data points roughly align with the quantile-quantile line, then we can say that the residuals are normally distributed.

Fig 8: Since the data is large and the residuals do not veer off the line overmuch, we can assume they follow a normal distribution.

Note: We cannot simply use a histogram of the residuals to check normality of the error distribution because the chosen number of bins can alter the shape of the histogram, as shown in the following examples.

Fig 9: The bin size of a histogram can alter the shape of the histogram, so it is not a true visual check for normality.

We can also use a Shapiro-Wilk test to check that the residuals originate from a normal distribution; however, if the data is large enough and is not too far from normality then, by the Central Limit Theorem, our assumption of normality is not that important, and any inference from the model will still be valid.

Assumption #4: Homoscedasticity

We can use Scale-Location plots to check homoscedasticity of the errors—that is, the variance is constant with an increase in the fitted values. If the square root of the absolute value of the residuals* is randomly distributed and the regression line is approximately horizontal, then the variance of residuals is constant, and the predictor variables X are homoscedastic.

*Recall that we cannot take the square root of a negative number.

Fig 10: Since there is no clear pattern and the regression line is roughly horizontal, we can assume the predictors are homoscedastic.

If the error variance increases, we may need to use the following tests for heteroscedasticity:

Bonus: Influential cases

The Residuals vs Leverage plot is primarily used to show points that are influential using Cook’s Distance, which estimates the extent of change to the regression model when an observation is removed from it. Extreme values that should be removed will appear within Cook’s Distance lines.

Fig 11: This plot shows that removing observation #369 will significantly change the model.
Table 1: Diagnostic plots and tests to check linear regression assumptions.

Evaluation metrics

There are many ways to evaluate a linear regression model; just as model selection depends on the underlying data, so does metric selection.

Coefficient of determination (R²)

The coefficient of determination is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). We can calculate the coefficient of determination by squaring Pearson’s correlation coefficient, R. Alternatively, we can derive R² from the equation:

Eqn 5: Formula for the coefficient of determination.

R² is non-decreasing against the number of predictors in the model, no matter how little information an extra predictor may provide. By definition, the coefficient of determination is within the space: 0 ≤ R² ≤ 1.

Adjusted coefficient of determination (adjusted R²)

The adjusted R² adjusts the coefficient of determination, R², by adding a penalty for every predictor variable in the model. This means that if a predictor does not significantly increase the R² value, then the adjusted R² will decrease. The equation gives this adjustment:

Eqn 6: Formula for the adjusted coefficient of determination.

The adjusted coefficient of determination also lies within the space:
0 ≤ adjusted R² ≤ 1.

Mean absolute error (MAE)

The mean absolute error (MAE) is the average magnitude of error in model predictions when compared to the response variable. These individual errors are weighted equally, so it does not have a special penalty for outliers.

Eqn 7: Formula for the mean absolute error.

The mean absolute error is always ≥ 0.

Mean squared error (MSE)

The mean squared error is the average squared error in model predictions and observations. This is an evaluation metric with a great natural fit for linear regression, as minimising MSE determines the model with the least squares. There is a high penalty placed on all errors due to error squaring, which means that large errors are magnified. This also means that MSE is given in squared units, with the equation:

Eqn 8: Formula for the mean squared error.

The mean squared error is always ≥ 0.

Root mean squared error (RMSE)

The root mean squared error is the square root of the mean squared error, MSE, which means errors are defined by the same metric units as the model. The RMSE has a one-to-one relationship with MSE.

Eqn 9: Formula for the root mean squared error.

The root mean squared error is always ≥ 0.

To facilitate comparison, I have summarised the space of possible values for each metric in Table 2, along with their advantages, disadvantages and use cases. The metric selected will determine the objective function: We want to maximise R² and Adjusted R², and minimise MAE, MSE and RMSE.

Table 2: Use cases for each linear regression evaluation metric.

From the table above, it is worth mentioning that:

  • We can use Adjusted R² instead of R² in all situations.
  • We can use RMSE instead of MSE in most situations since they have a one-to-one relationship and the RMSE defines errors by the same units as the model.

Summary of advantages and disadvantages

Advantages

Linear regression is simple to implement and it is easy to interpret its output coefficients. In comparison to most other algorithms, it has low complexity.

Disadvantages

Linear regression has restrictive assumptions which are ill-defined. Not all real-world relationships are linear, which means the use of this model may oversimplify real-world problems. Therefore the following are examples of valid questions when using linear regression:

  • How do we decide when regression assumptions are met?
  • Based on our data, which assumptions are most important?

Additionally, linear regression is sensitive to outliers and prone to overfitting when there are many predictor variables in the model compared to the number of observations.

Conclusions

Linear regression is a supervised learning method that predicts a continuous value. Simple linear regression has one predictor variable and one response variable, whereas multiple linear regression has more than one predictor variable for one response variable.

To implement linear regression, we must understand the concepts of coefficients, variance and residuals.

To fit a linear model, the data must satisfy the following four assumptions: linearity, independence, normality, and homoscedasticity. To test for linearity, we use a residual plot. To test for independence, we must review the data collection method. To test for normality, we can use a Normal Q-Q plot, a Shapiro-Wilk test, or rely on the Central Limit Theorem in the case of large, approximately-normal residuals. To test for homoscedasticity, we can use a Scale-Location plot, a Breusch-Pagan test, or a Koenker-Bassett test if the data are small or skewed.

We evaluate the performance of a linear regression model against the following metrics: adjusted coefficient of determination (Adjusted R²), mean absolute error (MAE), and root mean square error (RMSE).

The advantages of linear regression lie in its simplicity of implementation and interpretation, though its disadvantages include sensitivity to outliers and susceptibility to overfitting.

--

--

Myrnelle Jover
Decision Data

I am a data scientist and former mathematics tutor with a passion for reading, writing and teaching others. I am also a hobbyist poet and dog mum to Jujubee.