Diagnose your Linear Regression Model — With Python

Published in

Analytics Vidhya

10 min readMay 31, 2021

Regression diagnostics is a critical step in reaching a meaningful regression model.

One of the main interests of statisticians and data scientists is examining the association of one single variable, called target or response, with another set of variables, called predictors, for prediction or profiling purposes. The common steps involved in reaching this goal include but not limited to:

Pre-processing data
Examining the data
Model selection
Training the model
Model diagnostic

Most of the articles in the machine learning domain cover the first four steps but less emphasise the last step which might be the most important step in decision process. Diagnosing the model comprises many logical and repeatable processes to monitor and evaluate the output of the model by comparing with true value. Obviously, the metric we should work on this process is the residual error which is defined as difference between true value and predicted one. The concept of using the residuals to help guide the regression fitting is a fundamental step in the modeling process. In this article we briefly shed light on the linear regression diagnostics to improve our understanding from our model and improve it eventually.

In order to demonstrate the practicality of the proposed methods, we will use python language in implementation. To this end, we use the dataset “medical cost personal datasets”. You can find the data set from Kaggle website, here. In order to explain the data set, it consists of six different predictors and one target outcome named charges. This is the amount of medical insurance each person needs to pay for coverage. Each row holds the data on the contributor side. The age , sex, and bmi (body mass function) are self-explanatory. The children are the number of dependents each contributor has. Finally, the smoker and region denote the state of being smoker and the location of living, respectively.

We allocate most part of the discussion to the diagnostic process as training the model is already covered in here. The full python code can be found here. Hence, we start by the trained model:

df = pd.read_csv('insurance.csv')predictors = ['age', 'sex', 'bmi', 
              'children', 'smoker', 'region']
outcome = 'charges'df_encoded = pd.get_dummies(df[predictors], drop_first=True)
df_lm_factor = LinearRegression()
df_lm_factor.fit(df_encoded, df[outcome])
print(f'Intercept: {df_lm_factor.intercept_:.3f}') 
print('Coefficients:')
for name, coef in zip(X.columns, df_lm_factor.coef_):
    print(f' {name}: {coef}')

The output will be:

[output]:
Intercept: -11938.539
Coefficients:
 age: 256.85635253734864
 bmi: 339.1934536108373
 children: 475.5005451491269
 sex_male: -131.3143593951132
 smoker_yes: 23848.534541912835
 region_northwest: -352.9638994246546
 region_southeast: -1035.0220493878253
 region_southwest: -960.0509913008365

The intercept and coefficients of the model are shown bleow. Since, in our model some of the predictors are encoded through dummy variable, interpreting the model is a little bit tricky. For more details on treating the discrete predictors please refer here.

Interpreting the Regression Model

The most important task in data science is predicting a variable. However, interpreting the liner regression model is mainly as part of the profiling or data-based explanation. By looking at the coefficients of the model, we can see that for age, bmi, children, and smoker_yes it is positive indicating that increase in any of these variables results in increase of charges. For a numerical predictor, for example, bmi we can say that one unit increase in bmi would result in about 340 unit increase in charges. For other non-numerical variables, note that the smoker_yes is a factor variable which is converted via dummy variable to a numerical value to be used in regression model. So, it should be interpreted with reference to smoker = no. Thus, by looking at the coefficient of the smoker_yes we can see that this number is 23848 which means that being a smoker increase the charges by 23848 compared to a person that does not smoke. This explanation can be extended to the same factor variables in the model with negative coefficients.

Correlated Predictors

In multiple linear regression, with more than one predictor, there is a possibility that different predictors are correlated with each other. As an example, consider that a person who smokes, smoker=yesmight have higher bmi , this indicates that by knowing that an individual smokes we can conclude (with higher certainty) that the guy smokes, too. In linear regression model, the predictors that are highly correlated could lead to ill-behaved modelling. This is a critical point to keep in mind as it cannot be alleviated via cross-validation or other metrics which most often considered as model improver. In this kind of models, the high correlated coeeficients can be recognised through broader understanding of the context such as domain knowledge. If you see that the positive or negative coefficients is meaningless in the model be suspicious of correlated variables.

Multicollinearity

The correlated predictors lead to the one important concept in model validation called Multicollinearity. Multicollinearity simply means that there is redundant variables in the model. Eliminating one of them could improve the model dramatically. Multicollinearity in regression must be addressed, variables should be removed until the multicollinearity is gone [1]. A regression does not have a well-defined solution in the presence of perfect multicollinearity. Many software packages, including R and Python, automatically handle certain types of multicollinearity. For example, if age is included twice in the regression of the data, the results are the same as for the df_lm_factor model.

Confounding Variables

The confounding variable is the counterpart of the multicollinearity. In multicollinearity we commit more than enough while in confounding our model suffers from exclusion of necessary variables. For example, in our model increasing of bmi could lead to higher charges.However, this might be the metabolism reduction that leads to the higher bmi and hence charges. Confounding variable is not a simple problem and should be considered in deriving the casualty in every data-base decision-making process. Sometimes confounding variable can be tackled using the feature engineering by generating some variables from existing variables. However, in most cases confounding variable mandates knowledge domain and considering the context the problem is happening.

Testing the Assumptions

Most of the linear regression is founded based on some critical assumptions. Most of these assumptions can be verified by looking at the residual error. In the following, we will cover some of the metrics and notions used in diagnosing the linear regression models.

Outliers

The outlier is generally the sample that is distant from most of the other observations. Depending on the application, the outlier can be interesting. For example, in fraud detection the outlier is a sign of fraud in transaction which can trigger many resultant actions. In linear regression context, the outlier is the record that has high mismatch between predicted value and the true value. The distant can be quantified by standardised residual, which is the residual divided by the standard error of the residual. Standardized residuals can be interpreted as “the number of standard errors away from the regression line”. The python code for outlier detection based on the standardised residual can be found as follows:

res = df[outcome] - df_lm_factor.predict(df_encoded)
sres = np.abs(res / res.std())idx_sres, max_sres = sresiduals.idxmax(), sresiduals.max()df.iloc[idx_sres]

Running the code above results in identifying the sample with highest standardised residual/error as follows:

[Output]:
age                  45
sex                male
bmi               30.36
children              0
smoker              yes
region        southeast
charges     62592.87309
Name: 1300, dtype: object

Removing this sample might improve the model accuracy as it is four times more distant from the regression line. Please note that this approach for outlier detection assumes that the residual follows the normal distribution.

Influential Values

An influential value is the value that has a dramatic effect on the linear regression model. For example, excluding the influential sample leads to a dramatic change in the slope of the line and in general residual error. This sample is called influential observation. This value is not necessarily an outlier as it could have small standardised error (close to the regression line). It is said that this data has a high leverage on the regression.

A record can be different in outcome space or predictor space. The former is called outlier and the later is called the leverage. In other words, the predictor values of a given record can be different than other records or target outcome can be different than others. Identifying these characteristics of data can help in interpreting the result and validating the obtained model. To be an influential value, the sample needs to be different both in predictor and outcome space.

In the figure above, the point 8 is far from the points both in predictor space and outcome sapce. So, it is a influential point.

Statisticians have developed several metrics to determine the influence of a single record on a regression. A common measure of leverage is the hat-value. Values above 2 P + 1 /n indicate a high-leverage data value [1]. Here, P is number of predictors and n is total number of records. A hat value is simply the standardised version of distance of a given sample in predictor space. Hat values are bound between 1/n and 1, with 1 denoting highest leverage (highest distance from mean).

One common way to examine the influential points is to look at the influence plot or bubble plot. The bubble plot pltos the studentized residuals V.S. hat values. Studentized residuals are calculated based on the residual error and standard error without considering the possible influential sample. It also uses corresponding hat values. Another metric is Cook’s distance, which defines influence as a combination of lever‐ age and residual size. A rule of thumb is that an observation has high influence if Cook’s distance exceeds 4/ n − P − 1 [1].

The bubble plot for our data set can be implemented using the statsmodel package. The code is as follows:

# in order to include intercept, const=1
df_outlier = sm.OLS(df[outcome], df_encoded.assign(const=1)) 
result = df_outlier.fit()
influence = OLSInfluence(result)fig, ax = plt.subplots(figsize=(5, 5))
ax.axhline(-1.96, linestyle='--', color='C1')
ax.axhline(1.96, linestyle='--', color='C1')
ax.scatter(influence.hat_matrix_diag, influence.resid_studentized_internal, 
           s=1000 * np.sqrt(influence.cooks_distance[0]),
           alpha=0.5)ax.set_xlabel('hat values')
ax.set_ylabel('studentized residuals')plt.tight_layout()
plt.show()

higher hat values represents the higher leverage while the studentised residual demonstrate the possible outliers. If we use the 95% confidence interval for studentised error, any points which are not located in the [-1.96, 1.96] are potential outliers. Other possibilities is that the points with higher studentised residual needs more sophisticated model for prediction.

Heteroskedasticity

Heteroskedasticity is generally deals with the variance of the residuals. For a good model, the expected residuals is normally distributed with no correlation among samples (the samples can be normally distributed but still represents some correlations). Also the residual is expected to have constant variance across different ranges of predictions. Heteroskedasticity is lack of constant variance in residual error. In other words, some parts of data has higher residual errors than others. This is more critical in evaluating the hypothesis testing or prediction interval. Heteroskedasticity indicates that prediction errors differ for different ranges of the predicted value, and may suggest an incomplete model [1].

For our data, we can plot the prediction values versus residual error to investigate for possible heteroskedasticity. The following code will plot of the absolute value of the residuals versus the predicted values.

fig, ax = plt.subplots(figsize=(8, 5)) 
sns.regplot(x=result.fittedvalues, y=np.abs(result.resid),
scatter_kws={'alpha': 0.25}, line_kws={'color': 'C1'}, lowess=True, ax=ax, ci=95)

As can be seen, the variation of residual error around the regression line is almost constant for wide range of prediction values. This demonstrates that the model we have developed does not suffer from heteroskedasticity. The normality of the residual error can be verified by looking directly at the distribution of the residual error. The code below is used to derive the normal distribution of the studentised error.

fig, ax = plt.subplots(figsize=(4, 4))pd.Series(influence.resid_studentized_internal).hist(ax=ax)ax.set_xlabel('std. residual')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()

As can be seen from figure above the histogram of the studentised error is positive skewed as the residual stretched to positive values. This should be taken into account when determining the prediction interval.

Partial Residual Plots

How can we check for non-linearity in relation between outcome and the predictors? The partial residual plots provide a good insight for this enquiry. Partial residual plots are a way to visualize how well the estimated fit explains the relationship between a predictor and the outcome. The basic idea of a partial residual plot is to isolate the relationship between a predictor variable and the response, taking into account all of the other predictor variables [1]. In other words, the partial residuals plot can be used to qualitatively assess the fit for each regression term, possibly leading to alternative model specification.

Care should be taken if the predictor is highly correlated with any of the other independent variables. If this is the case, the variance evident in the plot will be an underestimate of the true variance.

Using the statsmodel package, the partial residual plot for bmi and age can be obtained as below:

The plots above shows that the age has higher correlation than bmiwith outcome.

Wrap-up

In this article, we’ve briefly presented the diagnostic approach in linear regression to analyse and evaluate the resultant model.

Reference

[1] Bruce, Peter, Andrew Bruce, and Peter Gedeck. Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. O’Reilly Media, 2020.