Different metrics to evaluate the performance of a Machine Learning model

Swapnil Vishwakarma
Analytics Vidhya
Published in
13 min readMar 14, 2021
Designed by Freepik www.freepik.com

In this article, we will understand what does performance metrics mean, what are the various types of performance metrics for a machine learning model also we will cover their advantages and drawbacks.

To understand what does performance metric means, I would like to give you an example of a swimming competition. Here all the competitors swim across the length of the pool to reach the finish line as soon as possible. So here we can say that the performance metric of a swimmer is the time he/she has taken to reach the finish line, the lesser the time taken the better is the performance of a swimmer. But if we talk about a football game of 90 minutes long then time is the constraint and not the performance metric. The performance metric in a football game is the number of goals scored, the higher the better. Now you can see how the metric changes according to the situation likewise we have different performance metrics for different problem statements in machine learning. So what does performance metric in machine learning mean? Well, the performance metric is the measure of how well a model performs on the unseen dataset. Typically the performance is measured after splitting the whole dataset into train and test datasets in 80:20 ratio respectively i.e. 80% of the dataset is used for training and the rest 20% is used for testing.

Choosing the right metric is equally important as choosing the right Machine Learning algorithm so that model’s performance does not degrade even after deployment. Hence to choose the right metric, we need to know the pros and cons of each metric and when to use what metric. First, we will be focusing on the classification tasks then we will see which metric is suitable for regression tasks.

We will understand the following metrics:

Accuracy

Confusion Matrix

Precision, Recall and F1-Score

ROC and AUC

Log Loss

R² or Coefficient of Determination

Median Absolute Deviation of Errors

Distribution of Errors

Metrics for Classification task!

A classification problem can be solved in two ways. First by using class labels or second by using probability scores.

Consider a binary classification with positive and negative classes.

1. Accuracy

It is defined as the ratio of the total no. of correctly classified points to the total no. of points in the test dataset.

It always lies between 0 and 1 (both inclusive), 0 being the worst and 1 being the best measure of accuracy.

Consider the football game again, suppose a player is trying to score as many goals as possible provide 10 chances. If the player manages to score 6 goals of the 10, then the accuracy is simply 6/10 which is 60%.

Similarly to determine the accuracy of a machine learning model, suppose I have 100 points in the test dataset and out of which 60 points belong to the positive class and 40 belong to the negative class. Now the model predicts 53 points as positive points and 35 points as negative points. After summing all the correctly classified points by the model we get a total of 88 correctly classified points which means the accuracy of our model is 88%.

Advantage:

It is one of the easiest metrics to understand.

Drawbacks:

1) Imbalanced dataset

Out of 100 points in the test dataset say 95 points belong to the positive class and only 5 points belong to the negative class. Let’s say I have a dumb model, and it only predicts the point to be in a positive class. Here we can see how even a dumb model can get an accuracy of 95% which is obviously not sensible. Therefore we should never use accuracy as a measure when we have an imbalanced dataset.

2) Accuracy cannot use probability score.

Suppose we have two models to compare which returns the probability score of a point belonging to a particular class, then accuracy only gives the class to which the point belongs and the probability of it belonging to that class.

For example, the true class label is positive, Model 1 gives the probability of this point belonging to the positive class as 0.9 whereas Model 2 gives 0.6 probability. Now if we use accuracy as a measure of performance, then both the models will classify the point to the positive class, however from the probability score it is clear that Model 1 is better than Model 2.

So, if the model returns a probability score of a point belonging to a particular class, then do not use an accuracy metric.

2) Confusion Matrix

Before we move forward, keep note of these abbreviations:

I) TP: True Positive

II) FP: False Positive

III) FN: False Negative

IV) TN: True Negative

V) TPR: True Positive Rate

VI) TNR: True Negative Rate

VII) FPR: False Positive Rate

VIII) FNR: False Negative Rate

Again we’ll consider the binary classification task where we have two possibilities i.e. 0 (negative class) and 1 (positive class).

So we create a matrix of size 2x2. The limitation of the Confusion matrix is that it cannot process the probability scores.

Confusion Matrix

Given xᵢ’s and yᵢ’s the model predicts Yᵢ’s that belongs to one of the class labels.

Matrix:

For multi-class classification (‘c’ number of classes), we have cxc confusion matrix.

If the model is sensible, principle diagonal values should be high, and off-diagonal values should be zero ideally.

For binary classification, each cell of this matrix is given a name:

Here most of us get confused by the terminology, so here is a neat trick to remember it. Let’s consider TN as two different parts. The second part (N) implies “What is the predicted label by the model” and the first part (T) validates that prediction. So in this case, the Negative class label is predicted by the model and it is actually true which means it is correctly classified.

FP is also known as Type 1 error and FN is also called Type 2 error.

Before moving ahead to understand 4 ratios, it is important to introduce few terms.

The sum of FN and TP is called “Total no. of positives (P)

The sum of TN and FP is called “Total no. of negatives (N)

n (total no. of points) = P + N.

a) True Positive Rate (TPR) = TP/P

b) True Negative Rate (TNR) = TN/N

c) False Positive Rate (FPR) = FP/N

d) False Negative Rate (FNR) = FN/P

Let us take an example of an imbalanced Test dataset that has 1000 points of which 900 points belong to the negative class and only 100 points belong to positive points. On constructing confusion matrix it has:

Let us now see those 4 ratios discussed earlier for a sensible model.

TPR = 94/100 = 94%

TNR = 850/900 = 94.44%

FPR = 50/900 = 5.55%

FNR = 6/100 = 6%

Let us now see those 4 ratios again but for a dumb model where it predicts every point to be negative class.

TPR = 0/100 = 0%

TNR = 900/900 = 100%

FPR = 0/900 = 0%

FNR = 100/100 = 100%

Ideally, we want TPR and TNR to be high ratios and FPR, FNR to be low ratios but from this case, we see that there is something fishy going on. So unlike accuracy, a confusion matrix along with these 4 ratios can help us understand the performance of our model even if the dataset is imbalanced.

Now you must be wondering instead of looking at just one number, you have to analyze 4 different ratios and you might be thinking which one of them is important for you and when. Let me tell you that it is very domain-specific.

Let’s take a medical application where the model diagnoses if a person has cancer or not. Given the person has cancer the model should classify it as TP hence TPR must be very high. Given the person has cancer the model classified it as FN hence FNR must be close to zero (ideally 0). Given the person doesn’t have cancer and the model predicts he/she has cancer then FPR will be high. This is O.K. as the person can go for further powerful tests to determine he/she doesn’t have cancer. So, in this case, the model mustn’t miss a cancer patient and hence the FNR becomes a very important ratio to look at.

3) Precision, Recall and F1-Score

Designed by Freepik www.freepik.com

Precision and Recall are mostly used in information retrieval and they care only about the positive class labels. For example, Google search has a huge amount of information but only the relevant information is shown according to the user’s query.

Precision = TP / (TP + FP)

Precision is also called Positive Prediction Value.

It means out of all the points the model predicted to be positive, what percentage of them are actually positive. It should always be high. It lies between 0 to 1.

Recall = TPR = TP / (TP + FN) = TP / P

It means out of all the points that actually belong to a positive class, what percentage of the model detected to be positive class. It should always be high. It lies between 0 to 1.

According to the problem statement Precision or Recall is used. Let me give you an example for this:

Photo by Stephen Phillips — Hostreviews.co.uk on Unsplash

a) Say you are working on a spam email detection problem wherein classifying an email wrongly i.e. if the email is not spam but the model classifies it as spam then it will cause you trouble as it can be very costly for the company. Here the False Positive should be observed keenly as it has more impact so Precision becomes important here.

b) Say you are working on a cancer detection problem wherein not predicting cancer for a cancerous patient would be life-threatening for the patient. Here the False Negative should be observed keenly as it has more impact so Recall becomes important in this case.

There is a way we can combine both precisions and recall into one measure known as F1-Score when the impact of both FP and FN are equally important.

Now you might be wondering from where did this metric come from. Let me tell you it is simply the average of Precision and Recall and taking the inverse of it. If I were to put it more mathematically, it the Harmonic mean of Precision and Recall.

F1-Score is high if Precision and Recall both are high. It also lies between 0 and 1. This metric is often used in Kaggle competitions.

F_β is a more generalized score that can be tuned according to the β value.

There are three cases:

a) Select β = 1 when FN and FP impact are equal. So this becomes F1-Score.

b) Select β = 0.5 (anywhere between 0 to 1, typically 0.5 is selected) when the impact of FN is more. So this becomes F0.5-Score.

c) Select β = 1 when the impact of FP is more. So this becomes F2-Score.

4) ROC and AUC

Receiver Operating Characteristic Curve (ROC) was designed by electronics and radio engineers during the Second World War to predict how well their missiles are working. Machine Learning takes all these nice concepts from physics, statistics, electronics, and many more domains to solve a real-world problem. This is mostly used for binary classification, however, there is an extension for multi-class classification which is not used often. So for the binary classification, the model predicts some score like probability score which is in the range of 0 to 1. The higher the score, the higher is the chance of it belonging to the positive class. AUC is nothing but the Area Under the Curve which also ranges between 0 and 1.

Comparing two Machine Learning Models

From the above graph, it is clear that Model 1 having a ROC-1 curve is much better than Model 2 having a ROC-2 curve.

Let us see the magic behind the curtains. Firstly sort the predicted scores in descending order and select a threshold value (τ) same as the first predicted score of the model in this sorted list and classify the query point to one of the two classes (positive or negative). Calculate FPR and TPR for τ₁. This process is repeated for all n points to get n different FPR and TPR. After getting all ’n’ number of values for FPR and TPR, it is plotted on TPR vs. FPR graph to get a curve which is known as Receiver Operating Curve. Earlier we have seen that both TPR and FPR values lie between 0 and 1, hence the area under the diagonal is exactly 0.5 units. The ROC should be much higher than this diagonal for the model to be a sensible model. If the model is a random model i.e. it predicts the class label randomly then, in that case, ROC will a straight diagonal line.

Properties of AUC:

a) For the imbalanced data, AUC can be high.

b) AUC is not dependent on the predicted scores of the model instead it depends on the ordering of it.

c) AUC of a random model = 0.5 units.

d) If AUC < 0.5 this implies the model is worse than a random model. This might happen when there is some mistake in the modeling. One simple fix is to swap the predicted class labels to get a better AUC value that is greater than 0.5.

5) Log Loss

This is the first metric that uses the actual probability scores from what we have seen so far.

For a binary classification of ’n’ points in a test dataset, log-loss is given as

Here pᵢ is nothing but the predicted score of the model which is the probability of it belonging to a positive class. We want log-loss to be as small as possible. It lies between 0 to infinity.

And yᵢ is the actual class label.

Let’s take an example of two points belonging to the positive class and the model gives a score of 0.9 and 0.6 respectively. Now from the above formula, penalizes the log-loss for a small deviation in probability score, i.e. for 0.9, the log-loss is 0.0457 (which is quite small and good) whereas for 0.6 it is 0.22 (which is not small).

This can easily be extended to multi-class log-loss where the model predicts not just two but ‘c’ no. of probabilities given it is a c-class classification problem. Since the range of values for log-loss is between 0 to infinity, its interpretability decreases. If the model returns the probability score then it is one of the most powerful metrics to determine its performance.

Metrics for Regression task!

6) R² or Coefficient of Determination

So far we have seen metrics only for classification models, now let’s take a look at regression models were given a query point xi the model predicts a Real number yᵢ. We have an error for every point, i.e. the difference between the actual value and the predicted value. Mathematically

Before I introduce R², let’s understand few terminologies related to it to better understand it.

a) Total Sum of Squares (SS)ₜ:

Where yₐ is the average value of yᵢ’s in the dataset.

The simplest model possible for regression is the predicted average value for all the query points.

So (SS)ₜ is the sum of squared errors using the simple mean model.

b) Sum of Squares of residues (SS)ᵣₑₛ:

where

Now

Let us now see what are the best and the worst cases for R²:

Case 1) (SS)ᵣₑₛ = 0

This means the summation of errors is 0 and R² will be equal to 1. This is the best possible value for R².

Case 2) (SS)ᵣₑₛ < (SS)ₜ

In this case, R² will lie between 0 and 1. This is the typical case that we come across.

Case 3) (SS)ᵣₑₛ = (SS)ₜ

This will give us the value of R² to be equal to 0. So in this case the model is the same as the simple mean model as discussed above.

Case 4) (SS)ᵣₑₛ > (SS)ₜ

In this case, R² will be a negative number which implies that the model is worse than a simple mean model.

Drawback: Since (SS)ᵣₑₛ is calculated using the mean/average of yᵢ’s and if one of the yᵢ’s is an outlier then it will greatly impact the R² value. This drawback is overcome by a metric we will understand next called as 1) Median Absolute Deviation of Errors.

7) Median Absolute Deviation of Errors

Since we know that mean and standard deviation are heavily impacted by even a single outlier point, and also we only care about the errors for a regression problem. Keeping these things in mind the concept of median and median absolute deviation (MAD) is used. As the name suggests it is the median absolute deviation of errors.

Median(eᵢ) is the central value of errors which is similar to mean and MAD is similar to standard deviation. These are robust to outliers. If we get small values for these two, we can say that the model is performing well.

8) Distribution of errors

Apart from the metrics seen so far for the regression model, we can also plot the PDF and CDF of errors to see what percentage of errors are small or large and it helps us analyze the performance of the model. This also makes it very easy to see how two different models perform on the same dataset i.e. we can compare two models very easily by just looking at the plot/graph.

References

  1. https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce
  2. https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/
  3. https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html
  4. https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_algorithms_performance_metrics.htm
  5. https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b
  6. https://en.wikipedia.org/wiki/Precision_and_recall
  7. https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
  8. https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234
  9. https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Don’t forget to clap and share if you learned something new or liked reading this article.

--

--