Performance Metrics of Supervised Learning

Sandeep Panchal
Analytics Vidhya
Published in
13 min readJul 6, 2019

My warm welcome to all the readers. This is my first blog on the subject related to machine learning. Let’s start with general things. What goes into your mind when you hear ‘Machine Learning’? Not going into technical terms to define what Machine Learning is; in general terms, Machine Learning is nothing but training a machine with certain data to make a machine to learn and analyze future unseen data.

Coming to our main theme of this blog which is ‘Performance Metric’. There are many performance metrics which we use to check how well our model has performed. Suppose we are into some cooking competition where our cooked food is being rated x out of y. Here x denotes rating given based on which we can analyze how well we have cooked.

Similarly, in Machine Learning, we have performance metrics to check how well our model has performed. We have various performance metrics such as Confusion Matrix, Precision, Recall, F1 Score, Accuracy, AUC — ROC, Log-Loss, etc.

In this blog, am going to discuss

  1. Precision
  2. Recall / Sensitivity
  3. F1-Score
  4. AUC-ROC Curve
  5. Log-Loss

Before getting into what precision, recall, and F1-score are, we first need to understand a confusion matrix. Not going deep inside a confusion matrix, I am going to give a small understanding of what a confusion matrix is.

Confusion Matrix:

Well, a confusion matrix is N*N dimension matrix wherein one axis represents ‘Actual’ label while the other axis represents ‘Predicted’ label.

Source Link: Google Image

From the above diagram- the Confusion Matrix is 2*2 dimension. X-axis represents the ‘Predicted’ label and Y-axis represents the ‘Actual’ label.

For a better understanding of what TP, FP, TN, and FN are, we will consider an example of- ‘If received mail is spam or ham.’

Positive — Mail received is ham

Negative — Mail received is spam

True Positive (TP): It represents the predicted label is positive and also actual label is positive — correctly predicted. We predicted mail received is ‘ham’ (positive) and actual mail received is also ‘ham’ (positive).

True Negative (TN): It represents the predicted label is negative and also actual label is negative — correctly predicted. We predicted mail received is ‘spam’ (negative) and actual mail received is also ‘spam’ (negative).

False Negative (FN): It represents the predicted label is negative but the actual label is positive — wrongly predicted. We predicted mail received is ‘spam’ (negative) but actual mail received is ‘ham’ (positive).

False Positive (FP): It represents the predicted label is positive but the actual label is negative — wrongly predicted. We predicted mail received is ‘ham’ (positive) but actual mail received is ‘spam’ (negative).

Confusion Matrix is the most intuitive and basic metric from which we can obtain various other metrics like precision, recall, accuracy, F1 score, AUC — ROC.

Now let us dive into Precision, Recall and F1 Score metrics.

1. Precision:

General Definition: Precision measures what proportion of predicted positive label is actually positive.

To explain precision and its use case, we shall consider ‘Machine Learning Course Recommendation’ example i.e we have to recommend students to opt for xyz machine learning course.

Positive — Course recommended to students

Negative — Course not recommended to students

Precision in terms of True Positive and False Positive:

Source Link: Google Image

From the above formula in the image, we can analyze that as ‘False Positive’ decreases, our precision increases and vice-versa. Let us see what ‘True Positive’, ‘False Positive’ and ‘False Negative’ are in ‘Machine Learning Course Recommendation’ example.

True Positive (TP): Predicted as course recommended to students and in actual also course recommended to students.

False Positive (FP): Predicted as course recommended to students but in actual course not recommended to students.

False Negative (FN): Predicted as course not recommended to students but in actual course recommended to students.

When to use Precision?

Precision is used when we want to mostly focus on false-positive i.e to decrease false-positive value thereby increase precision value. A question might arise why we want to mostly focus on false-positive and not false-negative. To answer this question, let us consider ‘Machine Learning Course Recommendation’ example.

False Positive (FP): It represents our predicted label is positive but the actual label is negative — wrongly predicted. Applying false positive on our example- it means we have predicted that course has been recommended to students but in the actual course was not recommended to students. If our false positive value is high, it clearly means we ought to miss a few or most students to recommend the course. This is going to be a loss to the institute(s) which is missing few or most students to recommend the course. So, we mostly focus on false-positive value and we try to decrease it to the least possible value.

False Negative (FN): It represents our predicted label is negative but the actual label is positive — wrongly predicted. Applying false negative on our example- it means we have predicted that course has not been recommended to students but in the actual course was recommended to students. If our false negative value is high, it clearly means we have to recommend a course to students who were already been recommended. It is not at all an issue if we recommend the course to students who were already been recommended. So, we don’t much focus on false negative value.

Conclusion: Keeping the above two reasons in mind, we mostly focus on false-positive value and try to decrease it to the least possible value thereby increase in precision value.

2. Recall / Sensitivity

General Definition: Recall measures what proportion of actual positive label is correctly predicted as positive.

To explain recall and its use case, we shall consider ‘Cancer Diagnosis’ example i.e we have to predict if a patient is diagnosed with cancer or not.

Positive — Patient diagnosed with cancer.

Negative — Patient not diagnosed with cancer.

Recall in terms of True Positive and False Negative:

Source Link: Google Image

From the above formula in the image, we can analyze that as ‘False Negative’ decreases, our recall increases and vice-versa. Let us see what ‘True Positive’, ‘False Positive’ and ‘False Negative’ are in ‘Cancer Diagnosis’ example.

True Positive (TP): Predicted as a patient diagnosed with cancer and in actual also patient diagnosed with cancer.

False Positive (FP): Predicted as a patient diagnosed with cancer but actual patient not diagnosed with cancer.

False Negative (FN): Predicted as a patient not diagnosed with cancer but actual patient diagnosed with cancer.

When to use Recall?

Recall is used when we want to mostly focus on false-negative i.e to decrease false negative value thereby increase recall value. A question might arise why we want to mostly focus on a false-negative and not false positive. To answer this question, let us consider ‘Cancer Diagnosis’ example.

False Negative (FN): It represents our predicted label is negative but the actual label is positive — wrongly predicted. Applying false negative on our example- it means we have predicted that the patient is not diagnosed with cancer but the actual patient is diagnosed with cancer. If this is the case, patient, as per prediction might not get treatment to cure cancer. But the truth is patient is diagnosed with cancer. Our wrong negative prediction will lead to death of a patient. So, we mostly focus on false-negative value and try to decrease it to the least possible value.

False Positive (FP): It represents our predicted label is positive but the actual label is negative — wrongly predicted. Applying false positive on our example- it means we have predicted that the patient is diagnosed with cancer but the actual patient is not diagnosed with cancer. If this is the case, patient, as per prediction will get check-up for cancer diagnosis. To his happiness, he will come to know that he is not diagnosed with cancer. Hurrah! He is free from cancer now. So, we don’t much focus on false-positive value.

Conclusion: Keeping the above two reasons in mind, we mostly focus on false-negative value and try to decrease it to the least possible value thereby increase in recall value.

F1-Score:

F1-score is another one of the good performance metrics which leverages both precision and recall metrics. F1-score can be obtained by simply taking ‘Harmonic Mean’ of precision and recall. Unlike precision which mostly focuses on false-positive and recall which mostly focuses on false-negative, F1-score focuses on both false positive and false negative.

To explain F1-score and its use case, we shall consider ‘Rose & Jasmine Flowers’ example i.e we have to predict if a flower is Rose or Jasmine.

Positive — Flower classified as a rose.

Negative — Flower classified as jasmine.

F1-score in terms of Precision and Recall:

Source Link: Google Image

Let us see what ‘False Positive’ and ‘False Negative’ are in ‘Rose & Jasmine Flower’ example.

False Positive (FP): Predicted flower as a rose but in actual flower is jasmine.

False Negative (FN): Predicted flower as jasmine but in actual flower is rose.

When to use F1-score:

As mentioned above, F1-score focuses on both false positive and false negative, at any cost we do not want the rose to be classified as jasmine and jasmine to be classified as a rose. In this case, we focus on both false-positive and false-negative, and try to decrease both false positive and false negative thereby increase F1-score.

Conclusion: Keeping the above reason in mind, we can say that our focus should be on both false positive and false negative, and try to decrease both false positive and false negative thereby increase F1-score.

4. AUC — ROC curve

AUC — ROC is one of the most important performance metric used to check model performance. AUC — ROC is used for binary and also multi-class classification but mostly used in binary classification problems. In this blog, we will consider a binary class classification.

AUC is expanded as Area Under Curve and ROC is expanded as Receiver Operating Characteristics. It is also known as AUROC and is expanded as Area Under Receiver Operating Characteristics.

AUC-ROC is a graphical representation of model performance. ROC is a probability curve and AUC is the measure of separability. Depending on the threshold set, we can analyze how well our model has performed in separating two classes. Higher the AUC better is our model in separating two classes.

Graphical representation of AUC — ROC:

Image result for auc roc curve
Source Link: Google Image

Referring the above image, we can see that AUC — ROC curve is plotted with FPR against TPR where FPR (False Positive Rate) is on X-axis while TPR (True Positive Rate) is on Y-axis. The green curve represents ROC curve while the area/region under ROC curve (green curve) represents AUC. Continuous black lines passing through origin is nothing but ‘threshold’.

Let’s first understand what TPR and FPR are:

True Positive Rate (TPR):

TPR is nothing but Recall / Sensitivity. The formula for TPR as follows

True Positive Rate

False Positive Rate (FPR):

The formula for TPR as follows

False Positive Rate

Interpretation of AUC-ROC curve:

Let’s now look into the analysis of binary class classification based on the AUC score and ROC curve. For this, we will take the above example ‘Cancer Diagnosis’ i.e we have to predict if a patient is diagnosed with cancer or not.

Red curve: Positive — Patient diagnosed with cancer.

Green curve: Negative — Patient not diagnosed with cancer.

Example — 1:

Threshold set to 0.5. There is no overlap between the two curves (green and red). This is the best model with AUC score of 1.0. This indicates that the probability of a model to separate positive and negative class is 1.0. In other words, we can say that there is 100% chance model can separate positive and negative class.

ROC curve with AUC score of 1.0

Example — 2:

Threshold set to 0.5. There is a little bit of overlap between the two curves (green and red). This is a good model with AUC score of 0.8. This indicates that the probability of a model to separate positive and negative class is 0.8. In other words, we can say that there is 80% chance model can separate positive and negative class.

ROC curve with AUC score of 0.8

Example — 3:

Threshold set to 0.5. We can see the full overlap between the two curves (green and red). This is a bad model with AUC score of 0.5. This indicates that the probability of a model to separate positive and negative class is 0.5. In other words, we can say that there is 50% chance model can separate positive and negative class.

ROC curve with AUC score of 0.5

Example — 4:

Threshold set to 0.5. We can see the overlap between two curves (green and red) and also curves are reciprocating. This is the worst model with AUC score of 0.2. This indicates that the probability of a model to separate positive and negative class is 0.2. In other words, we can say that there is 20% chance model can separate positive and negative class.

ROC curve with AUC score of 0.2

What makes AUC — ROC curve unique from other metric is that in AUC — ROC curve, we can set threshold value. Setting the threshold value depends on business needs and importance. For example: In the medical field, threshold value is set to 0.95. So AUC score above 0.95 is said to be best, worst otherwise.

When to use AUC — ROC curve:

Well, this is quite confusing sometimes due to contradictory statements such as we can us AUC — ROC when the data is balanced, and on the other hand, we can use when the data is imbalanced, too.

  1. Balanced data: There is a clear picture that AUC — ROC can be used when the data is almost balanced say the proportion of the positive and negative class is around 60:40 or 70:30. It gives a good interpretation of True Positive Rate (TPR) and False Positive Rate (FPR).
  2. Imbalanced data: On the other hand, AUC — ROC can also be used when the data is imbalance by leveraging the flexibility of setting the threshold value based on data and business needs.

Note: In reference to the above second point, adding a note that AUC — ROC is rarely used when the data is imbalanced. Instead, the Precision-Recall curve is used as it makes more sense than AUC — ROC curve. In the case of imbalanced data, it is sometimes hard to select the metric and it merely depends on data and business needs. One of the main advantages of AUC — ROC is, we can set the threshold based on business needs.

5. Log — Loss

Log — loss (Logarithmic Loss) is one of the good metric used to check the performance of the model. Log-loss penalizes false classifications by taking the probability of classification into account.

Unlike other metrics, log-loss uses probability scores. It is used in both binary class classification and multi-class classification. As log-loss increases, predicted probabilities deviates from actual labels. Lower the log-loss, better the model is. So, our objective is to minimize log-loss to the least we can. A model with a log-loss of 0 is said to be the best model.

Log — loss formula as follows:

Log — Loss Formula

Graphical representation of log-loss:

Graphical Representation of Log — Loss

When to us Log-Loss?

  1. Log-loss is used when the model output is the probability of binary class {0,1}. The main advantage of log-loss is, it penalizes the false classification or wrong prediction.
  2. Log-loss can be used for both balanced and imbalanced data. Since log-loss is impacted by imbalanced data, it is always good to balance data by using techniques like over-sampling, under-sampling, etc.

Summary:

  1. Precision: Precision measures what proportion of predicted positive label is actually positive. We mostly focus on false-positive value and try to decrease it to the least possible value thereby increase in precision value.
  2. Recall: Recall measures what proportion of actual positive label is correctly predicted as positive. We mostly focus on false-negative value and try to decrease it to the least possible value thereby increase in recall value.
  3. F1 Score: F1 Score is the ‘Harmonic Mean’ of precision and recall. We focus on both false-positive and false-negative and try to decrease both false-positive and false-negative thereby increase F1 Score.
  4. AUC — ROC curve: It is a graphical representation of ROC curve and region/area under curve i.e AUC. It is mostly used in binary class classification. It interprets the probability or percentage of separability of positive and negative classes. Higher the AUC — ROC, better is our model in separating positive and negative classes.
  5. Log-Loss: Log-loss also logarithmic loss is used for both binary and multi-class classification. It is based on the probability score, unlike other metrics. Lower the log-loss better is our model.

Sources:

  1. https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/2982/log-loss/3/module-3-foundations-of-natural-language-processing-and-machine-learning
  2. https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/2981/receiver-operating-characteristic-curve-roc-curve-and-auc/3/module-3-foundations-of-natural-language-processing-and-machine-learning
  3. https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/2980/precision-and-recall-f1-score/3/module-3-foundations-of-natural-language-processing-and-machine-learning
  4. https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
  5. https://medium.com/thalus-ai/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b
  6. http://wiki.fast.ai/index.php/Log_Loss
  7. https://medium.com/datadriveninvestor/understanding-the-log-loss-function-of-xgboost-8842e99d975d
  8. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
  9. https://classeval.wordpress.com/simulation-analysis/roc-and-precision-recall-with-imbalanced-datasets/
  10. https://stats.stackexchange.com/questions/180116/when-is-log-loss-metric-appropriate-for-evaluating-performance-of-a-classifier

--

--