How to evaluate the performance of a machine learning model

12 min readApr 25, 2020

Well, In this blog and I’m super excited to start with this concept of Evaluation metrics . We will discuss the different metrics used to evaluate a Regression and Classification problems. So lets first understand it and also implement it using python

How do I measure the performance of the models ?

A good fitting model is one where the difference between the actual and observed values or predicted values for the selected model is small and unbiased for train ,validation and test data sets.

1.RMSE

The most commonly used metric for regression tasks is RMSE (root-mean-square error). This is defined as the square root of the average squared distance between the actual score and the predicted score:

Here, yi denotes the true score for the ith data point, and yi denotes the predicted value. One intuitive way to understand this formula is that it is the Euclidean distance between the vector of the true scores and the vector of the predicted scores, averaged by n, where n is the number of data points.

Implementation using Python:

# include Newspaper
X = data[['TV', 'Radio', 'Newspaper']]
y = data.Sales# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)# Instantiate model
lm2 = LinearRegression()# Fit Model
lm2.fit(X_train, y_train)# Predict
y_pred = lm2.predict(X_test)# RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))1.40465142303# exclude Newspaper
X = data[['TV', 'Radio']]
y = data.Sales# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)# Instantiate model
lm2 = LinearRegression()# Fit model
lm2.fit(X_train, y_train)# Predict
y_pred = lm2.predict(X_test)# RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))1.38790346994

2.Mean Squared Error

Mean Squared Error is difference between of the estimated values and what you get as a result. The predicted value is based on some equation and tell what you will expect as an average but the result you get might differ from this prediction which is a slight error from the estimated value. This difference is called MSE. This determines how good is the estimation based on your equation.

3.Mean Absolute Error is the measure of the difference between the two continuous variables. The MAE is the average vertical distance between each actual value and the line that best matches the data. MAE is also the average horizontal distance between each data point and the best matching line.

Implementing Linear Regression

Implementation using Python:

#importing Linear Regression and metric mean square errorfrom sklearn.linear_model import LinearRegression as LRfrom sklearn.metrics import mean_absolute_error as mae

Training Model:

# Creating instance of Linear Regresssion with Normalised Datalr = LR(normalize = True)# Fitting the modellr.fit(train_x, train_y)

Output:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

Predicting over the train set

# Predicting over the Train Set and calculating errortrain_predict = lr.predict(train_x)k = mae(train_predict, train_y)
print('Training Mean Absolute Error', k )Training Mean Absolute Error 822.5458775969962

Predicting over the test set

# Predicting over the Test Set and calculating errortest_predict = lr.predict(test_x)k = mae(test_predict, test_y)print('Test Mean Absolute Error    ', k )Test Mean Absolute Error     872.4151667761614

4.R² is (coefficient of determination) regression score function. It is also called as coefficient of determination. R² gives us a measure of how well the actual outcomes are replicated by the model or the regression line. This is based on the total variation of prediction explained by the model. R² is always between 0 and 1 or between 0% to 100%.

SSE = Actual value -Predicted value

Sum square of total (SST), how far is the actual value when compared to the mean value

SST = Actual value -Mean value

Sum square of Regression(SSR), how far is the actual value when compared to the mean value

SSR = Predicted value -mean value

If the error in prediction is low then SSE will be low and R² will be close to 1.

A caution of note here, when we add more independent variables, R² gets higher value. R² value keeps on increasing with addition of more independent variables even though they may not really have a significant impact on the predictions. This does not help us to build a good model.

To overcome this issue, we use Adjusted R². Adjusted R² penalizes the model for every addition of an insignificant independent variable.

A value close to 1 for R² means a good fit.we can also calculate root mean square error also referred as RMSE.

Implementation using Python:

For the performance_metric function in the code cell below, you will need to implement the following:

Use r2_score from sklearn.metrics to perform a performance calculation between y_true and y_predict.
Assign the performance score to the score variable.

# TODO: Import 'r2_score'
from sklearn.metrics import r2_scoredef performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """    # TODO: Calculate the performance score between 'y_true' and 'y_predict'
    score = r2_score(y_true, y_predict)    # Return the score
    return score# Calculate the performance of this model
score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
print "Model has a coefficient of determination, R^2, of {:.3f}.".format(score)Model has a coefficient of determination, R^2, of 0.923.

Answer:

R² = 92.3%
This implies 92.3% of variation is explained by the target variable and it seems to be high.
Potential pitfall: We only have five points here, and it may be hard to draw conclusion that is statistically significant.

5.We have seen evaluation metrics for regression — we now explore the evaluation metrics for classification For classification, the most common metric is Accuracy.

Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio between the number of correct predictions and the total number of predictions While accuracy is easy to understand, the accuracy metric is not suited for unbalanced classes. Hence, we also need to explore other metrics for classification. A confusion matrix is a structure to represent classification and it forms the basis of many classification metrics.

There are 4 important terms:

True Positives: The cases in which we predicted YES and the actual output was also YES.

True Negatives: The cases in which we predicted NO and the actual output was NO.

False Positives: The cases in which we predicted YES and the actual output was NO.

False Negatives: The cases in which we predicted NO and the actual output was YES.

Accuracy for the matrix can be calculated by taking average of the values lying across the “main diagonal” i.e.

6.Area Under Curve

One of the widely used metrics for binary classification is the Area Under Curve(AUC) AUC represents the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. The AUC is based on a plot of the false positive rate vs the true positive rate which are defined as:

The area under the curve represents the area under the curve when the false positive rate is plotted against the True positive rate as below.

AUC ranges between 0 and 1.

A value of 0 means 100% prediction of the model is incorrect. A value of 1 means that 100% prediction of the model is correct.

AUC has a range of [0, 1]. The greater the value, the better is the performance of the model because the closer the curve is towards the True positive rate. The AUC shows the correct positive classifications can be gained as a trade-off between more false positives. The advantage of considering the AUC i.e. the area under a curve .. as opposed to the whole curve is that — it is easier to compare the area (a number) with other similar scenarios. Another metric commonly used is Precision-Recall. The Precision metric represents “Out of the items that the classifier predicted to be relevant, how many are truly relevant? The recall answers the question, “Out of all the items that are truly relevant, how many are found by the ranker/classifier?”. Similar to the AUC, we need a numeric value to compare similar scenarios.

7 . F1 score

A single number that combines the precision and recall is the F1 score which is represented by the harmonic mean of the precision and recall.

Implementation using Python:

# imports for classifiers and metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)# Decision Tree Classifier

# instantiate
dtc = DecisionTreeClassifier()

# fit
dtc.fit(X_train, y_train)

# predict
y_pred = dtc.predict(X_test)

# f1 score
score = f1_score(y_pred, y_test)

# print
print "Decision Tree F1 score: {:.2f}".format(score)Decision Tree F1 score: 0.55# Gaussian Naive Bayes

# instantiate
gnb = GaussianNB()

# fit
gnb.fit(X_train, y_train)

# predict
y_pred_2 = gnb.predict(X_test)

# f1 score
score_2 = f1_score(y_pred_2, y_test)

# print
print "GaussianNB F1 score: {: .2f}".format(score_2)GaussianNB F1 score:  0.53

8.Area Under Curve Analysis

We use Area Under Curve to look at the performance of the model.

Calculate the area under the perfect model (aP) till the random model (a).
Calculate the area under the prediction model (aR) till the random model (a).
Calculate accuracy rate (AR) = aR / aP

The closer the value of AR to 1, the better.

# Area under Random Model
a = auc([0, total], [0, class_1_count])# Area between Perfect and Random Model
aP = auc([0, class_1_count, total], [0, class_1_count, class_1_count]) - a# Area between Trained and Random Model
aR = auc(x_values, y_values) - aprint("Accuracy Rate for Support Vector Classifier: {}".format(aR / aP))Accuracy Rate for Support Vector Classifier: 0.9688542825361512

9.Accuracy Analysis and Testing Data Science Models

ROC Curve Analysis

The ROC analysis curve is very important both in statistics and in data science. It signifies the performance of a test or model by measuring its overall sensitivity (true positive) vs. its fall-out or (false positive) rate.

This is crucial when determining the viability of a model.

Like many great leaps in technology, this was developed due to war.

In World War 2 It was used to detect enemy aircraft. Its usage has since then spread into multiple fields. We have used it to detect similarities of bird songs, the response of neurons, the accuracy of tests and much, much more.

How does ROC work?

When you run a machine learning model, you have inaccurate predictions. Some of these inaccurate predictions are because it should have been labeled true for instance but instead it was labeled false.

Others should have been false when they were true.

Since predictions and statistics are really just very well supported guesses, what is the probability your prediction is correct?

It is important to have an idea of how right you are!

Using the ROC curve, you can see how accurate your prediction is and with the two different parables you can figure out where to put your threshold.

Your threshold is where you decide whether your binary classification is positive or negative, true or false.

It is also what creates what your X and Y variables are for your ROC curve.

As the two parables get closer and closer, your curve will lose the area underneath it.

This means your model is less and less accurate. No matter where you put your threshold.

The ROC curve is one of the first tests used when modeling with most algorithms. It helps detect problems early on by telling you whether or not your model is accurate.

Receiver Operating Characteristic (ROC) Curve

The Receiver Operating Characteristic better known as the ROC Curve is an excellent method of measuring the performance of a Classification model. The True Positive Rate is plot against False Positive Rate for the probabilities of a classifier predictions. The area under the plot is calculated.

More the area under the curve, better is the model at distinguishing the classes.

The steps are as follows:

Calculate probabilities of the classification using predict_proba
Select a class you want to plot, in this case the second class (with label 1.0)
Using sklearn.metrics.roc_curve calculate the True Positive Rate (TPR) and the False Positive Rate(FPR).
Plot TPR on the y-axis and FPR on the x-axis.
Calculate the area under this curve using sklearn.metrics.auc.

Implementation using Python:

from sklearn.metrics import roc_curve, auc

plt.figure(figsize = (20, 12))
plt.plot([0,1], [0,1], 'r--')

probs = supportVectorClassifier.predict_proba(X_test)
# Reading probability of second class
probs = probs[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, probs)
roc_auc = auc(fpr, tpr)

label = 'Support Vector Classifier AUC:' + ' {0:.2f}'.format(roc_auc)
plt.plot(fpr, tpr, c = 'g', label = label, linewidth = 4)
plt.xlabel('False Positive Rate', fontsize = 16)
plt.ylabel('True Positive Rate', fontsize = 16)
plt.title('Receiver Operating Characteristic', fontsize = 16)
plt.legend(loc = 'lower right', fontsize = 16)

Out[10]:

<matplotlib.legend.Legend at 0x150089d68>

10.Log Loss/Binary Crossentropy

Log loss is a pretty good evaluation metric for binary classifiers and it is sometimes the optimization objective as well in case of Logistic regression and Neural Networks.

Binary Log loss for an example is given by the below formula where p is the probability of predicting 1.

As you can see the log loss decreases as we are fairly certain in our prediction of 1 and the true label is 1.

When to Use?

When the output of a classifier is prediction probabilities. Log Loss takes into account the uncertainty of your prediction based on how much it varies from the actual label. This gives us a more nuanced view of the performance of our model. In general, minimizing Log Loss gives greater accuracy for the classifier.

How to Use?

Implementation using Python:

from sklearn.metrics import log_loss
# where y_pred are probabilities and y_true are binary class labelslog_loss(y_true, y_pred, eps=1e-15)

Limitations

It is susceptible in case of imbalanced datasets. You might have to introduce class weights to penalize minority errors more or you may use this after balancing your dataset.

11.Categorical Crossentropy

The log loss also generalizes to the multiclass problem. The classifier in a multiclass setting must assign a probability to each class for all examples. If there are N samples belonging to M classes, then the Categorical Crossentropy is the summation of -ylogp values:

y_ij is 1 if the sample i belongs to class j else 0

p_ij is the probability our classifier predicts of sample i belonging to class j.

When to Use?

When the output of a classifier is multiclass prediction probabilities. We generally use Categorical Crossentropy in case of Neural Nets. In general, minimizing Categorical cross-entropy gives greater accuracy for the classifier.

How to Use?

Implementation using Python:

from sklearn.metrics import log_loss  
# Where y_pred is a matrix of probabilities with shape = (n_samples, n_classes) and y_true is an array of class labelslog_loss(y_true, y_pred, eps=1e-15)

Limitation

It is susceptible in case of imbalanced datasets.

Conclusion

The choice of an evaluation metric should be well aligned with the business objective and hence it is a bit subjective. And you can come up with your own evaluation metric as well.

Thanks for the read. I am going to be writing more beginner-friendly posts in the future too. Follow me up to be informed about them.

Clap if you liked the article!

How to evaluate the performance of a machine learning model

Implementing Linear Regression

Training Model:

Predicting over the train set

Predicting over the test set

SSE = Actual value -Predicted value

Sum square of total (SST), how far is the actual value when compared to the mean value

SST = Actual value -Mean value

SSR = Predicted value -mean value

5.We have seen evaluation metrics for regression — we now explore the evaluation metrics for classification For classification, the most common metric is Accuracy.

8.Area Under Curve Analysis

9.Accuracy Analysis and Testing Data Science Models

ROC Curve Analysis

Receiver Operating Characteristic (ROC) Curve

10.Log Loss/Binary Crossentropy

When to Use?

How to Use?

Limitations

11.Categorical Crossentropy

When to Use?

How to Use?

Limitation

Conclusion

Clap if you liked the article!

Written by Vijay Choubey