Metrics and Plots for Analyzing Linear regression models

Sayed Ahmed
School of ML
Published in
5 min readAug 28, 2020

This article discusses some of the metrics and plots used to analyse Linear regression model and understand if the model is suitable for your datasets to proceed with.

It will discuss following topics briefly:

Metrics: MSE, RMSE, MAE, R-Squared, Adjusted R-Squared

Plots: Actual vs Predicted graph, Histogram of residual, Residual vs. Fitted Values Plot, Normality Q-Q Plot, Scale Location Plot, Residuals vs Leverage

Metrics For Linear Regression Models

I’m briefly introducing some of the matrices used for evaluating the performance of Linear regression models.

  1. Mean Square Error(MSE)

Mean Square Error gives can help you understand how much your predicted results deviate from the actual number.

It tends to amplify the impact of outliers on the model’s accuracy. For example, suppose the actual y is 10 and predictive y is 30, the resultant MSE would be (30–10)² = 400.

Sometimes it may be less useful as a single bad prediction may hamper the entire model’s predicting abilities, especially when dataset can contain a lot of noise. besides It is most useful when the dataset contains outliers, or unexpected values (too high or too low values).

2. Root Mean Square Error(RMSE)

Root Mean Square Error(RMSE) is the square root of MSE. It is interpreted as how far on an average, the residuals are from zero

RMSE is much more useful when large errors are present and they drastically affect the model’s performance.

4. Mean Absolute Error(MAE)

MAE not very sensitive to outliers in comparison to MSE since it doesn’t punish huge errors.

By squaring MSE gives larger penalisation to big prediction error whereas MAE treats all errors the same.

3. R Square and Adjusted R Square

R Square is a good measure to estimate how good the model fits the dependent variables.

It measures how much of variability in dependent variable can be explained by the model. R-squared is always between 0 and 1. bigger value indicate better fit.

here, 0 indicates that the model explains none of the variability in dependent variable and 1 indicates that the model explains all the variability in dependent variable.

However Overfitting problems are not considered in R Square and that’s where Adjusted R Square comes in handy.

Plots For Linear Regression Analysis

  1. Actual vs Predicted graph
Actual vs Predicted graph for Linear regression

From scatter plots of Actual vs Predicted You can tell how well the model is performing. For Ideal model, the points should be closer to a diagonal line.

from the above image you can see left model is performing better then the right one.

If the model has higher R Square value all the points would be closer to the diagonal line. However for lower R Square indicated the model hence plots will be far away from this diagonal line. see the image below .

Actual vs Predicted graph with different r-squared values

2. Histogram of residual

Residuals in a statistical or machine learning model are the differences between observed and predicted values of data.

One of the assumption in Linear regression is that the residual should be normally distributed, if your model’s residual is not normally distributed it will not have a bell shaped curve which indicates that your model is not bias and in this case for your dateset regression may not be an appropriate choice.

Your residuals should be normally distributed for Linear regression

3. Residual vs. Fitted Values Plot

In this scatter plot the y axis represents residuals and the x axis represented fitted values or predicted value. This plot is used to detect non-linearity, unequal error variances, and outliers in the model.

For Ideal model, this plot is not supposed to show any pattern. But if any pattern is visible such as curve, U shape then it indicates that there is non-linearity in the data set.

Residual vs. Fitted Values Plot

The above graph shows funnel shape pattern, it indicates that the data is suffering from heteroskedasticity, meaning the error terms have non-constant variance.

For ideal model The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable. which is not the case for the above graph.

If The residuals somewhat form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.

4. Normality Q-Q Plot

This plot is used to determine the normal distribution of errors.
For normally distributed data, observations should lie approximately on a straight line. If the data is non-normal, the points form a curve deviating from a straight line which is a problematic situation.

5. Scale Location Plot

This plot shows if residuals are spread equally along the ranges of predictors. Using this graph the assumption of equal variance or homoscedasticity can be checked. It’s good if you see a horizontal line with equally (randomly) spread points.

The left graph, the residuals are randomly spread. But in the right graph the residuals begin to spread wider along the x-axis as it passes around 5 which indicates visible it indicated heteroskedasticity.

6. Residuals vs Leverage

This plot can be helpful in finding influential cases if there’s any. Not all outliers (having huge values) are influential in linear regression , some cases could be very influential even if they have value within a reasonable range. So removing or excluding these values can change the results.

Residuals vs Leverage

To find influential case we need to look for outlying values at the upper right corner or at the lower right corner in this graph. These areas can contain points influential against the regression line

The dashed line shown in the graph called Cook’s distance and when cases are outside of the Cook’s distance meaning having higher Cook’s distance scores are influential to the regression results. The regression results will be altered if we exclude those cases.

--

--