Understanding Performance metrics for Machine Learning Algorithms

Performance metrics explained — How do they work and when to use which?

Published in

Analytics Vidhya

13 min readAug 6, 2020

Performance metrics are used to evaluate the overall performance of Machine learning algorithms and to understand how well our machine learning models are performing on a given data under different scenarios. Choosing the right metric is very essential to understand the behavior of our model and make necessary changes to further improve the model. There are different types of performance metrics. In this article, we’ll have a look at some of the most used metrics.

Confusion Matrix.

A confusion matrix is used to evaluate the performance of classification algorithms.

As we can see from the image above, a confusion matrix has two rows and two columns for binary classification. The number of rows and columns of a confusion matrix is equal to the number of classes. Columns are the predicted classes, and rows are the actual classes.

Now let’s look at each block of our confusion matrix:

1) True Positives (TP): In this case, the actual value is 1 and the value predicted by our classifier is also 1

2) True Negatives (TN): In this case, the actual value is 0 and the value predicted by our classifier is also 0

2) False Positives (FP) (Type 1 error): In this case, the actual value is 0 but the value predicted by our classifier is 1

3) False Negatives (FN) (Type 2 error): In this case, the actual value is 1 but the value predicted by our classifier is 0

Source of Image: Effect Size FAQs by Paul Ellis

The end goal of our classification algorithm is to maximize the true positives and true negatives i.e. correct predictions and minimize the false positives and false negatives i.e. incorrect predictions.

False negatives can be worrisome especially in medical applications e.g., Consider an application where you have to detect breast cancer in patients. Suppose a patient has cancer but our model predicted that she doesn’t have cancer. This can be dangerous as the person is cancer positive but our model failed to predict it.

Accuracy.

Accuracy is the most commonly used performance metric for classification algorithms. Accuracy can be defined as the number of correct predictions divided by Total predictions. We can easily calculate accuracy from the confusion matrix using the below formula.

Accuracy works well when the classes are balanced i.e. equal number of samples for each class, but if the classes are imbalanced i.e. unequal number of samples per class, then accuracy might not be the right metric.

Why is accuracy an unreliable metric for imbalanced data?

let’s consider a binary classification problem where we have two classes of cats and dogs, where cats consist of 90% of the total population and dogs consist of 10%. Here cat is our majority class and the dog is our minority class. now if our model predicts every data point as cats still we can get a very high accuracy of 90%.

This can be worrisome especially when the cost of misclassification of minority class is very high e.g., in applications such as fraud detection in credit card transactions, where the fraudulent transactions are very less in number compared to non-fraudulent transactions.

Recall or sensitivity.

Recall can be defined as the number of correct positive predictions divided by the sum of correct positive predictions and incorrect positive predictions, it is also called a true positive rate. The recall value ranges from 0 to 1.

Recall can be calculated from the confusion matrix using the below formula. The recall metric is used when the classes are imbalanced.

Recall answers the following question:- Out of all the actual positive class samples how many did we correctly predict as positive and how many should have been predicted as positive but were incorrectly predicted as negative?

Recall is all about minimizing the False Negatives or Type 2 error, so when our objective is to minimize false negatives we choose recall as a metric.

Why is recall a good metric for imbalanced data?

let’s consider the example of an imbalanced dataset from the confusion matrix above, there are 1100 total samples in the dataset out of which 91% samples belong to the negative class, the TP, TN, FP, FN values are

True positive = 20

True Negative=800

False-positive = 200

False Negative=80

now if we put these values in our recall formula we get recall = 0.2, this means that out of all the actual positive class samples only 20% were correctly predicted as positive and 80% samples should have been predicted as positive but were incorrectly predicted as negative.

Here, we can see that despite getting a high accuracy of 74.5% the recall score is very low as the number of false negatives is more than the number of true positives.

Specificity.

Specificity can be defined as the number of correct negative predictions divided by the sum of correct negative predictions and incorrect negative predictions, it is also called a true negative rate. The specificity value ranges from 0 to 1.

The specificity value can be calculated from the confusion matrix using the below formula.

Specificity answers the following question:- Out of all the actual negative class samples how many did we correctly predict as negative and how many should have been predicted as negative but were incorrectly predicted as positive?

let’s consider the example of an imbalanced dataset from the confusion matrix above, the TP, TN, FP, FN values are

True positive = 20

True Negative=800

False positive = 200

False Negative=80

If we put these values in our specificity formula we get specificity = 0.8, this means that out of all the actual negative class samples 80% were correctly predicted as negative and 20% samples should have been predicted as negative but were incorrectly predicted as positive.

Precision.

Precision can be defined as the number of correct positive predictions divided by the sum of correct positive predictions and incorrect negative predictions. The precision value ranges from 0 to 1.

Precision value can be calculated from the confusion matrix using the below formula. The precision metric is used when classes are imbalanced.

Precision answers the following question:- Out of all the positive predictions how many were actually positive and how many were actually negative but we incorrectly predict them as positive?

Precision is all about minimizing the False Positives or Type 1 error, so when our objective is to minimize false positives we choose precision as a metric

Why is Precision a good metric for imbalanced data?

let’s consider the example of an imbalanced dataset from the confusion matrix above, the TP, TN, FP, FN values are,

True positive = 20

True Negative=800

False-positive = 200

False Negative=80

now if we put these values in our precision formula we get precision = 0.09, this means that out of all the positive predictions only 9% were actually positive the remaining 91% were actually negative but were incorrectly predicted as positive

Here, we can see that despite getting a high accuracy of 74.5% the precision score is very low as the number of false positives is more than the number of true positives

What do different values of precision and recall mean for a classifier?

High precision (Less false positives)+ High recall (Less false negatives):

This model predicts all the classes properly

High precision (Less false positives)+ Low recall (More false negatives):

This model predicts few values but most of the predicted values are correct

Low precision (More false positives)+ High recall (Less false negatives):

This model predicts many values, but most of its predicted values are incorrect.

Low precision (More false positives)+ Low recall (More false negatives):

This model is a no-skill classifier and can’t predict any class properly

F1- Score.

F1-score uses both precision and recall values. It is the harmonic mean of Precision and recall score. The F1-score can be calculated using the below formula.

F1-score gives a balance between Precision and recall. It works best when the precision and recall scores are balanced. we always want our classifiers to have high precision and recall but there is always a trade-off between precision and recall when tuning the classifier.

let’s consider we have two different classifiers one with high precision score and other with high recall score if we want to output the performance of each classifier as a single metric then we can use F1-score, as we can see F1-score is nothing but a weighted average of precision and recall scores

If precision=0 and recall=100, then F1-score=0.

F1-score always gives more weight-age to the lower number, If even one value of precision or recall is low then F1-score is pulled down

There are different variants of F1-score

a) Macro F1 score/ Macro averaged F1-score

For the Macro F1 -score, we first calculate the F1-score for each class and then take a simple average of the individual class F1-scores.

The Macro F1-score can be used when u want to know how your classifier is performing overall across each class.

b) weighted average F1-score

For the weighted average F1 -score, we first give weight-age to the F1-score of each class by the number of samples in that class, then we add these values and divide them by the total number of samples.

F-beta score.

Sometimes it is important to give more importance to either precision or recall while calculating the F1-score, in such cases we can use the F-beta score. The F-beta score is a generalization of the F1-score where we can adjust the beta parameter to give more weight-age to either precision or recall depending upon the application.

The default beta value is 1, which is the same as the F1-score. A smaller beta value gives more weight to precision and less weight to recall whereas a larger beta value gives less weight to precision and more weight to recall in the calculation.

For Beta = 0: we get F-beta score = precision

For Beta = 1: we get a harmonic mean of precision and recall

For Beta = infinity: we get a F-beta score = Recall

For 0<Beta<infinity: if the beta value is close to 0, we get an F-beta score closer to precision and if it is a larger number then we get an F-beta score closer to recall

ROC (Receiver Operating Characteristics) curve.

The ROC curve is a metric used to visualize the performance of a binary classification problem. ROC is a probability curve, it is a plot of False positive rate vs True positive rate. The AUC score is the area under the ROC curve. The higher the AUC score better the performance of the classifier. The AUC score values can range from 0 to 1.

The True Positive Rate can be calculated using the below formula. Larger the True positive rate means that all positive points are classified correctly.

The False Positive Rate can be calculated using the below formula. Smaller the False positive rate means that all negative points are classified correctly.

The above plot shows the ROC curve,

A smaller value on the x-axis means fewer false positives and more true negatives, which means more negative points are classified correctly.

A larger value on the y-axis means more true positives and fewer false negatives, which means more positive points are predicted correctly.

What do different values of the AUC score mean?

AUC score < 0.5: the classifier is worse than a no-skill classifier

AUC=0.5: it’s a no skill classifier and is not able to classify positive and negative classes, it just randomly predicts any class

0.5<AUC<1: it’s a skillful classifier, there is a higher chance that this classifier can classify positive and negative classes properly, as a higher AUC score means the FPR is low and the TPR is high

AUC=1: it’s a perfect skill classifier and it can classify all negative and positive class points accurately

Precision-Recall Curve.

Precision is a metric that quantifies the number of correct positive predictions made and Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made.

The Precision-Recall curve is a plot of precision vs recall, on the x-axis, we have recall values and on the y-axis, we have precision values. A larger area under the precision-recall curve means high precision and recall values, where high precision means a low false-positive rate, and high recall means a low false-negative rate.

The above plot shows a precision-recall curve, A no-skill classifier is a horizontal line on the plot, A skillful classifier is represented by a curve bowing towards (1,1) and a perfect skill classifier is depicted as a point at the co-ordinate (1,1)

Log Loss.

Log Loss(Logarithmic Loss) is also called as logistic loss or cross-entropy loss. log loss improves the accuracy of a model by penalizing false classifications. in general, the lower the log loss, the higher the accuracy. The values of log loss can range between 0 to infinity.

To calculate log loss the model assigns a probability to each class, For log loss, the output of the classifier needs to be probability values between 0 and 1, these probability values signify the confidence of how likely a data point belongs to a certain class. log loss measures the amount of uncertainty of our predictions based on how much it varies from the actual value

The log loss for multi-class classification is calculated using the below formula

where,

M = Number of classes

N = Number of samples

y_ij = it’s a binary indicator if observation (i) belongs to class j it’s 1 or else it’s 0

p_ij = it indicates the probability of observation (i) belonging to class j

Let’s say a no skill classifier which predicts any class randomly has a log loss of 0.6, then any classifier which predicts a log loss of more than 0.6 is a poor classifier.

MSE (Mean Squared Error).

Mean squared error is the average of the square of the difference between the actual value and the predicted value. mean squared error is a metric used for regression analysis.

MSE gives us the mean of the distance from actual points to the points predicted by our regression model i.e., how far our predicted values are from the actual values.

Computing the gradient is much easier in MSE and it penalizes larger errors very well.

The squaring is done to reduce the complexity so that there are no negative values.

Lower the MSE means the model is more accurate and the predicted values are very close to the actual values.

where,

Y = actual values

Y^ = predicted values

MAE (Mean Absolute Error).

MAE is another metric that is used for regression analysis. It is quite similar to mean squared error, It is the average of the absolute difference between the actual value and predicted value.

Unlike MSE, MAE gives us the absolute average distance between the actual points and the points predicted by our regression model i.e., how far our predicted values are from the actual values.

Computing the gradient is not easy in MAE and larger errors are not penalized in prediction.

Lower the MAE means the model is more accurate and the predicted values are very close to the actual values.

where,

Y = actual values

Y^ = predicted values

— — — — — — — — — — — — — —

Thank you for reading this article. For any suggestions or queries please leave a comment.

References.

How to Use ROC Curves and Precision-Recall Curves for Classification in Python — Machine Learning…

It can be more flexible to predict probabilities of an observation belonging to each class in a classification problem…

machinelearningmastery.com

Precision-Recall — scikit-learn 0.23.1 documentation

Example of Precision-Recall metric to evaluate classifier output quality. Precision-Recall is a useful measure of…

scikit-learn.org

Choosing the Right Metric for Evaluating Machine Learning Models — Part 2 — KDnuggets

In the first blog, we discussed some important metrics used in regression, their pros and cons, and use cases. This…

www.kdnuggets.com

If you like this article, do share it with others.

Understanding Performance metrics for Machine Learning Algorithms

Performance metrics explained — How do they work and when to use which?

Confusion Matrix.

Accuracy.

Why is accuracy an unreliable metric for imbalanced data?

Recall or sensitivity.

Why is recall a good metric for imbalanced data?

Specificity.

Precision.

Why is Precision a good metric for imbalanced data?

What do different values of precision and recall mean for a classifier?

F1- Score.

There are different variants of F1-score

F-beta score.

ROC (Receiver Operating Characteristics) curve.

What do different values of the AUC score mean?

Precision-Recall Curve.

Log Loss.

MSE (Mean Squared Error).

MAE (Mean Absolute Error).

References.

How to Use ROC Curves and Precision-Recall Curves for Classification in Python — Machine Learning…

It can be more flexible to predict probabilities of an observation belonging to each class in a classification problem…

Precision-Recall — scikit-learn 0.23.1 documentation

Example of Precision-Recall metric to evaluate classifier output quality. Precision-Recall is a useful measure of…

Choosing the Right Metric for Evaluating Machine Learning Models — Part 2 — KDnuggets

In the first blog, we discussed some important metrics used in regression, their pros and cons, and use cases. This…

Written by Kiran Parte