Evaluation Metrics for Classification Models

Most common metrics that have been used widely while evaluating a classification model

Shweta Goyal
Analytics Vidhya
11 min readJul 20, 2021

--

Photo by Ernesto Velázquez on Unsplash

Introduction

Evaluation metrics are used to measure the quality of the model. One of the most important topics in machine learning is how to evaluate your model. When you build your model, it is very crucial to measure how accurately it predicts your expected outcome.

We have different evaluation metrics for a different set of machine learning algorithms. For evaluating classification models, we use classification metrics and for evaluating regression models, we use regression metrics. In this article, I’ll talk about only classification metrics. Before we dive deep into evaluation metrics, let’s talk about some basics:

Some WarmUp:……

There are various evaluation metrics available for both supervised and unsupervised machine learning algorithms.

Supervised Learning - We have labeled inputs, the task is to predict labeled outputs by training a model. These algorithms are further classified into two categories: Classification and Regression:-

  • Classification: Based on some inputs, it predicts a category. These problems attempt to classify a data point into a specific category/class. Here, the target outcome will be a discrete/categorical value like Yes/No, Spam/Not spam, etc. For example, whether the customer would default on their loan or not.
  • Regression: Based on some inputs, it predicts a certain number. These problems use input variables to predict continuous values by using only training data. Here, the target outcome will always be a quantity/real value like time-series data, sales figures, heights, weights, etc. For example, what is the expected amount of default from their customer or the price of the house?

Unsupervised Learning - These algorithms help us to analyze and cluster unlabeled data sets. We don’t have any target labels. These algorithms help us to discover hidden patterns in data.

Train/Validation/Test Split:

Source: Stackexchange

Before we dive into metrics, we have certain model evaluation procedures. We need to know how well a model will generalize to out-of-sample data.

  • Training and Testing on same data: When you train and test on the same data, you will end up overfitting the training data and it won’t generalize either.

While evaluating your model, it’s best to not train your model on the entire dataset.

  • Train and Test split: A typical train/test split would be to use 70% of the data for training and 30% of the data for testing. It’s important to evaluate your model to find the best parameters of a model but we can’t use this test set for evaluation because the parameters we get may be the best but may not necessarily generalize well.

It’s best to split the data into two parts i.e. train and test set. It’s better to split the data into three parts i.e. train, validation and test set.

  • Train/Validation/Test split: To evaluate your model while still building and tuning the model, we need to create a third subset which is the validation set. The split would be to use 60% of the data for training, 20% of the data for validation, and 20% of the data for testing. A validation set is used to evaluate the performance of the model with different hyperparameter values. It is also used to detect overfitting during the training stage.

It’s useful to shuffle the data before making splits so that each split has accurate representation of the data.

Let’s dive into metrics.

Why do we need evaluation metrics?

Evaluation metrics can help you assess your model’s performance, monitor your ML system in production, and control your model to fit your business needs.

Our goal is to create and select a model which gives high accuracy on out-of-sample data. It’s very crucial to use multiple evaluation metrics to evaluate your model because a model may perform well using one measurement from one evaluation metric while may perform poorly using another measurement from another evaluation metric.

Classification Evaluation Metrics

Here, I’ll discuss some common classification metrics used to evaluate models.

Classification Accuracy:

The simplest metric for model evaluation is Accuracy. It is the ratio of the number of correct predictions to the total number of predictions made for a dataset.

Accuracy is useful when the target class is well balanced but is not a good choice with unbalanced classes.

For example, A dataset with two target classes containing 100 samples. 98 samples belong to class A and 2 samples belong to class B in our training data, our model would give us 98% accuracy. That’s why we need to look at more metrics to get a better result.

Here, we have python implementation of Accuracy:

Computing the Accuracy

Accuracy gives us an overall picture of how much we can rely on our model’s prediction. This metric is blind to the difference between classes and types of errors. That’s why it is not good enough for imbalanced datasets.

Logarithmic Loss or Log Loss:

Log Loss can be used when the output of the classifier is a numeric probability instead of a class label. Log loss measures the unpredictability of the extra noise that comes from using a predictor as opposed to the true labels.

Log loss for a binary classifier:

Source: Stackoverflow

Log loss for multi-class classification:

Consider, N samples belong to the M class. where,

y_ij indicates whether sample i belongs to class j or not

p_ij indicates the probability of sample i belonging to class j

How does it work?

Source

The above image is of actual(Target) and Predicted probabilities.

  • The top image depicts a poor prediction because of the large difference between the actual and predicted probability which gives us a large log loss. Here, the function penalizes the wrong answer that the model is confident about.
  • The bottom image depicts a good prediction because the predicted probability is close to the actual probability which gives us a small log loss. Here, the function is rewarding a correct answer that the model is confident about.

Log loss doesn’t have an upper bound and it exists on the range [0, ∞). Minimizing log loss gives greater accuracy for the classifier.

Here, we have a scikit-learn implementation of Log Loss:

Computing the Log loss

Confusion Matrix:

A confusion matrix or error matrix is a table that shows the number of correct and incorrect predictions made by the model compared with the actual classifications in the test set or what type of errors are being made.

This matrix describes the performance of a classification model on test data for which true values are known. It is a n*n matrix, where n is the number of classes. This matrix can be generated after making predictions on the test data.

Source: alearningaday

Here, columns represent the count of actual classifications in the test data while rows represent the count of predicted classifications made by the model.

Let’s take an example of a classification problem where we are predicting whether a person is having diabetes or not. Let’s give a label to our target variable:

1: A person is having diabetes | 0: A person is not having diabetes

Four possible outcomes could occur while performing classification predictions:

  • True Positives (TP): Number of outcomes that are actually positive and are predicted positive.

For example: In this case, a person is actually having diabetes(1) and the model predicted that the person has diabetes(1).

  • True Negatives (TN): Number of outcomes that are actually negative and are predicted negative.

For example: In this case, a person actually doesn’t have diabetes(0) and the model predicted that the person doesn’t have diabetes(0).

  • False Positives (FP): Number of outcomes that are actually negative but predicted positive. These errors are also called Type 1 Errors.

For example: In this case, a person actually doesn’t have diabetes(0) but the model predicted that the person has diabetes(1).

  • False Negatives (FN): Number of outcomes that are actually positive but predicted negative. These errors are also called Type 2 Errors.

For example: In this case, a person actually has diabetes(1) but the model predicted that the person doesn’t have diabetes(0).

Positive and Negatives refers to the prediction itself. True and False refers to the correctness of the prediction.

Here, we have a scikit-learn implementation of the Confusion matrix:

Calculating the Confusion matrix

From Scikit-learn official doc:

The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.

We can get 4 classification metrics from the Confusion Matrix:

1.) Accuracy:

It can also be calculated in terms of positives and negatives for binary classification:

It doesn’t grant us much information regarding the distribution of false positives and false negatives.

Here’s a scikit-learn implementation of accuracy score:

Calculating the accuracy score

2.) Precision or Positive Predictive Value (PPV):

It is the ratio of True Positives to all the positives predicted by the model. It is useful for the skewed and unbalanced dataset. The more False positives the model predicts, the lower the precision.

For example, we have a medical test of 20 patients and the test identifies 8 of them have the disease. Of the 8 identified by the test, 5 actually had the disease (true positives) while the other 3 did not (false positives). We later find out that the test missed the 4 additional patients who turned out to have the disease (false negatives).

The values are TP=5, FP=3, FN=4, TN=8.

Precision = 5/5+3 = 0.625

Here’s a scikit-learn implementation of Precision:

Calculating the Precision score

3.) Recall or Sensitivity or True Positive Rate(TPR):

It is the ratio of true positives to all the positives in your dataset. It measures the model’s ability to detect positive samples. The more false negatives the model predicts, the lower the recall.

From the previous example of precision, the values are TP=5, FP=3, FN=4, TN=8.

Recall = 5/5+4 = 0.56

Here’s a scikit-learn implementation of Recall:

Calculating the Recall score
  • The precision takes into account how both the positive and negative samples were classified, but the recall only considers the positive samples in its calculations. In other words, the precision is dependent on both the negative and positive samples, but the recall is dependent only on the positive samples (and independent of the negative samples).
  • The precision considers when a sample is classified as Positive, but it does not care about correctly classifying all positive samples. The recall cares about correctly classifying all positive samples, but it does not care if a negative sample is classified as positive.

4.) F1-score or F-measure:

It is a single metric that combines both Precision and Recall. The higher the F1 score, the better is the performance of our model. The range for F1-score is [0,1].

F1 score is the weighted average of precision and recall. The classifier will only get a high F-score if both precision and recall are high. This metric only favors classifiers that have similar precision and recall.

Here’s a scikit-learn implementation of the F1-score:

Calculating the f1-score

Special Case: F-score with factor β

The F1-score is a generalized case of the overall F-score. The overall F-score has a factor β, which defines how much influence precision/recall has over the evaluation:

  • β < 1: Precision oriented evaluation
  • β >1: Recall oriented evaluation

F1 score is a generalized case where β is 1, meaning precision and recall are balanced.

Here’s a scikit-learn implementation of the F-beta score:

Calculating the fbeta-score

ROC Curve:

A ROC curve (Receiver Operating Characteristic curve) is a graph showing the performance of a classification model. It is a way to visualize the tradeoff between the True Positive Rate (TPR) and False Positive Rate(FPR) using different decision thresholds (the threshold for deciding whether a prediction is labeled “true” or “false”) for our predictive model.

This threshold is used to control the tradeoff between TPR and FPR. Increasing the threshold will generally increase the precision, but a decrease in recall.

First, let’s see TPR and FPR:-

  • True Positive Rate (TPR / Sensitivity / Recall): True Positive Rate corresponds to the proportion of positive data points that are correctly considered as positive, for all positive data points.
  • False Positive Rate (FPR): False Positive Rate corresponds to the proportion of negative data points that are mistakenly considered as positive, for all negative data points.

They both have values in the range of [0,1] which are computed at varying threshold values.

The perfect classifier will have high value of true positive rate and low value of false positive rate.

Below is the ROC curve represents a more precise model:

Source: Wikipedia
  • Any model with a ROC curve above the random guessing classifier line can be considered as a better model.
  • Any model with a ROC curve below the random guessing classifier line can outrightly be rejected.

This curve plots TPR and FPR at different classification thresholds but this is inefficient because we have to evaluate our model at various thresholds. There’s an efficient, sorting-based algorithm that can provide us this information which is AUC.

Here’s a scikit-learn implementation of the ROC Curve:

Computing the ROC Curve

AUC:

An AUC (Area Under the Curve) or Area Under the ROC Curve, thus the term is short for roc_auc.

AUC is a metric used to summarize a graph by using a single number. It is used for binary classification problems.

Note: AUC is a function that gives points on a curve.

AUC is equal to the probability that the classifier ranks a random positive example more highly than a random negative example.

Here’s a scikit-learn implementation of AUC:

Computing the AUC

AUC helps compare different models since it summarizes the data from the whole ROC curve. AUC has a range of [0,1]. The greater the value, the better is the performance of our model.

Conclusion

In this article, we have learned so many things. We learned accuracy is only a part of the story for your model’s performance, especially when the data is not balanced or where a false positive has a larger impact or vice-versa. We discussed some other common metrics that have been widely used.

Check out the scikit-learn official documentation for other metrics which is not covered in this post.

Scikit-learn: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

--

--