Evaluation of Linear Regression Model

Mukesh Chaudhary
8 min readJul 31, 2020

How to better Analysis Regression Model’s performance via metrics

Hi Everybody , In this blog , I would like to discuss some of metrics to better analysis to regression model in case of overfitting and under-fitting. Model evaluation is very important in data science. It helps you to understand the performance of your model and makes it easy to present your model to other people.We know we can easily get all metrics methods from sklearn library. However, I think we should know how to use those metrics for better analysis of regression problem because it has different concept than classification problem. If we involve in data science project , we will get most of problem classification types, and we use confusion matrix and other metrics most of time for evaluation of classification models. In classification problem , we will get either yes or no answers from prediction models , so it’s easy to evaluate models. But in regression problem, accuracy in regression model is slightly harder to illustrate. It is impossible for you to predict the exact value but rather how close your prediction is against the real value. Here , In this blog, I am trying to emphasize that we should evaluate every single metrics to make sure good performance of model rather than only R square metric. Because some of time , when we get good R square number like 0.95 then we assume model can predict more accurate . But it doesn’t happen always true. So How to evaluate regression model , let’s start

1. R Square/Adjusted R Square

2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)

3. Mean Absolute Error(MAE)

4. illustrate Residual of model as a normal distribution ( bell shape)

5. By OLS from statemodels.formula

R Square/Adjusted R Square :

This is a first measure of regression model especially we, everybody, do during evaluation because it is easy to interpret score between 0 to 1. If we see good score like close to 1, then we assume that model is good fit. Of course , R Square is a good measure to determine how well the model fits the dependent variables. However, it does not take into consideration of overfitting problem. If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data but performs badly for testing data.So I recommend that we have to see all perspective for better evaluation . let’s talk what is actually mean R² . R² is calculated by the sum of squared of prediction error divided by the total sum of square which replace the calculated prediction with mean. R Square value is between 0 to 1 and bigger value indicates a better fit between prediction and actual value.

Sometime , R² is very helpful to measure error on model than Mean Square Error(MSE) and Mean Absolute Error(MAE). For instance , We can say R² is perfect measure to give you how is model like on below figure.

Here , I also want to focus R² Adjust measure too. Sometime, We see R² and R² Adjust same score. But When we do fine tuning to model to get better accuracy then R² Adjust help us to better understand. It happen when we add more independent features or penalize more feature due to over fitting . Then we can see different score between on these measures.

Mean Square Error(MSE)/Root Mean Square Error(RMSE):

while R² is a relative measure of how well the model fit dependent variables, whereas Mean Square Error is an absolute measure of the fit of model. MSE is calculated by sum of square of prediction error. Where prediction error is minus between true values and prediction values, and then it is made by square because we avoid negative error score. It’s result gives us how much deviation from actual number. It’s number might be larger number which may be like uncommon . you might be question how is error score is too big .

For example

print(mean_squared_error(Y_test, Y_pred))
print(math.sqrt(mean_squared_error(Y_test, Y_pred)))
# MSE: 109.86374118394116
# RMSE: 10.48159058463653

Above example, we can see MSE is too big score . So for this solution, we can use Root Mean Square Error metric which gives better interpretation about model. Again , One more question comes in our mind that how the number represent good or bad fit model. According to theory, if MSE or RMSE is 0.0 , this it has no error. But real life project , we never get this 0.0 error. We always get some error score , now how to evaluate the number. First of all , we have to calculate mean value of dependent variables . Then we can compare mean value of dependent variable and RMSE error score . After that , we can see some of percentage of deviation from real value. For example: we take dependent values from below example where dependent values nearly close to mean value 499.31 and we got from model RMSE is 10.48 . So it is 2.09 % difference between true values and prediction values.

print(mean_squared_error(Y_test, Y_pred))
y = df['Yearly Amount Spent']
y
#################
0 587.951054
1 392.204933
2 487.547505
3 581.852344
4 599.406092
...
495 573.847438
496 529.049004
497 551.620145
498 456.469510
499 497.778642
Name: Yearly Amount Spent, Length: 500, dtype: float64

Python example can be seen this link Sklearn.

Mean Absolute Error(MAE):

This is almost same to Mean Square Error metric but only MAE take absolute error value instead of square of predicted error for avoiding negative score . However, here , we don’t need to calculate Root of MAE score . We can interpret directly the score with real values.

Python Sklearn example.

Explore Residual :

At last for evaluation , We can also explore residuals, which comes from true values and predicted values, by scatterplot or diskplot of searbearn library or matplotlib. If we get linear shape on scatter plot or bell shape in distplot , then we can pretty say that model fit perfectly, and can predict very close to real values. For Example :

# explore residualresidual = y_test - y_predsns.distplot(residual)

Output:

Scatter plot by matplotlib:

plt.scatter(y_test,y_test)
plt.xlabel("Real Values")
plt.ylabel("predicted values")

Output:

After make sure all evaluation which represent that model can predict very well, then we can check coefficient of independent features that tells us our solution of problems. For example, Here i took dummy data from kaggle, fit regression model , and evaluate model by all metrics.

OLS from statemodels.formula:

By statemodels library , we can explore all over summary on one place like R² , R² adjust , coeff etc…

For example

All code is in python:

## Evaluation Regression model by all metrics
## Data is taken from kaggle which dummy data only for practice.
# import necessaries libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns# for warning messageimport warnings
warnings.filterwarnings("ignore")
# import datadf_original = pd.read_csv("Ecommerce Customers.csv")
df = df_original.copy()
df.head()# check datatype of features and null values
df.info()
# check statistics overall
df.describe()
# data Exploration
sns.pairplot(df)
# check correlation
sns.heatmap(df.corr(),annot= True)
# split dataX = df[['Avg. Session Length','Time on App','Time on Website','Length of Membership']]
X.head()
y = df['Yearly Amount Spent']
y
# mean values
np.mean(df['Yearly Amount Spent'])
# import class , methods from sklearn librariesfrom sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# split data for training and test set data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)
# check dimension of every splited dataset
print(df.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# fit data in linearRegression
lg = LinearRegression()
lg.fit(X_train,y_train)
lg.score(X_train,y_train)
#0.9854240629700333
# predit value
y_pred = lg.predict(X_test)
# import evaluation metrics from sklearn libraryfrom sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error# R2 score which lies between 0 to 1. close to 1 score shows better a fit modelprint("R2 Score")
r2_score(y_pred,y_test)
#0.9782625350414402
# mean square errorprint("Mean Square Error")
mean_squared_error(y_test,y_pred)
#109.86374118394116
print("Root Mean Square Error")
print(np.sqrt(mean_squared_error(y_test,y_pred)))
#10.48159058463653
# Mean Absolute Errorprint("Mean Absolute Error")
mean_absolute_error(y_test,y_pred)
#8.558441885315286
# explore residualresidual = y_test - y_predsns.distplot(residual)sns.scatterplot(y_test,y_pred)plt.scatter(y_test,y_test)
plt.xlabel("Real Values")
plt.ylabel("predicted values")
# Coeffecient of modelcoeffecients = pd.DataFrame(lg.coef_,X.columns)
coeffecients.head()
# Avg. Session Length 25.596259
# Time on App 38.785346
# Time on Website 0.310386
# Length of Membership 61.896829
# evaluate from statemodelsimport statsmodels.formula.api as smf# merge data X_train and y_train for ols formula
train_data = pd.merge(y_train, X_train, left_index=True, right_index=True)
# remove all white space from columns name by rename
train_data.rename(columns={'Yearly Amount Spent':'Yearly_Spent', 'Avg. Session Length':'Avg_SessionLeg', 'Time on App':'Time_App',
'Time on Website':'Time_Website', 'Length of Membership':'Length_Mem'},inplace= True)
df_dependent = train_data['Yearly_Spent']
df_independent = train_data.drop(labels = ['Yearly_Spent'],axis =1)
# making formula
featureFormula = "+".join(df_independent.columns)
sm_formula = "Yearly_Spent ~ " + featureFormula
# fit model
results = smf.ols(sm_formula, data=train_data).fit()
results.summary()

All above codes and Data can be also found following link.

Conclusion:

I tried to explore how to evaluate Regression Model because i believe that it has difference technique than classification problem. We have to make sure by all metrics and exploration before go to any conclusion of model. In additional , we can also compare evaluation metrics score with other algorithms like RandomForestRegressor . That give us comparison idea between them. OLS formula ,which is import from statemodels library, is also good for seeing overall summery.

References:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.htmlhttps://www.w3schools.com/python/https://www.youtube.com/watch?v=urbORR5XuY4

--

--