Performance Metrics: Machine Learning through visuals.

Published in

Machine Learning through visuals

7 min readAug 15, 2018

#3: What are some performance metrics used in ML?

Welcome to “Machine Learning through visuals”. In this series, I want the reader to quickly recall and more importantly retain the concepts through simple visual cues shown in this article. A large body of research indicates that visual cues help us to better retrieve and remember information. Here, the concepts are not discussed in detail. The assumption is that the reader already knows the concept but wants a quick recap of it.

The purpose of a metric is to evaluate a machine learning model. Depending on the application, we might need to evaluate different characteristics in the results.

In this post, you will see all the well-known metrics on a page, with examples for each and a visual cue to recall. The example will help you recollect the importance of a metric and describe a scenario in which it is useful. Visual cues will help you clarify the confusion matrix (which I assume you know) and beyond.

Let’s start by listing all the classification metrics commonly used:

Accuracy
Precision
Recall
Specificity (Whoa! It’s a tongue-twister. Say it 5 times now!)
F1 Score
AUC ROC
Logarithmic Loss

Ok, so let’s start with —

Accuracy — Ratio of correct predictions to total predictions made. Suitable only when there are about equal number of observations in each class.

Example: Assume there is a set of images containing about equal (need not be exactly equal) number of cats and dogs. We want to implement an algorithm to correctly classify the image. On the other hand, if we have a database where there is a significant disparity between the number of positive and negative labels. For example, if we have 95 dog images and just 5 cat images, an algorithm which classifies all of them to be ‘dog’ will have an accuracy of 95% which is very misleading!

Ideally we want the accuracy to be as high as possible → 100%

2. Precision — How many samples classified as positive class (TP + FP) i.e. “Predicted Positive”, truly belong to positive class (TP).

Example: Repeating the example mentioned above, if we have 95 dog images and just 5 cat images, an algorithm which classifies all of them to be ‘cat’. Precision for classifying cats for this algorithm has TP = 5, FP = 95. So Precision (for classifying cats) = 5/(5+95) = 5%. Here the precision metric indicates something is terribly wrong with the algorithm.

Having high precision means that when you classify an image as ‘cat’, you’re usually right about it.

Ideally we want the precision to be as high as possible → 100% (for both the classes or least for the class of interest.) Precision is also called Positive Predictivity.

3. Recall — Of all the samples that are positive (TP + FN) i.e. “Actually Positive”, how many were correctly classified as positive (TP).

Example: Repeating the example mentioned above, if we have 95 dog images and just 5 cat images, a stupid algorithm which classifies all of them to be ‘cat’. Recall for classifying cats for this algorithm has TP = 5, FN = 0. So Recall (for classifying cats) = 5/(5) = 100%. But the precision of such a model is 5%.

Having high recall means that you can identify most of the ‘cats’ given in a data set.

From the above example it is clear, Recall is about capturing all actual cat images as cats. Recall is also called Sensitivity.

I know, recall and precision still sound confusing. Consider the following two applications:

a. Consider an intrusion detection application that sounds an alarm if an intruder is detected outside the door of a house. In this type of use case, ideally we want all intrusions to be detected, presumably at any cost. This is to say we want False Negatives (FN) → 0, which means we want a Recall of 100%. Precision is a little less important in this application as we can tolerate a few false positives that cause no harm. In a dataset of 100 examples, let’s say there are 5 intrusion cases and 95 cases of no intrusion. If we use an algorithm which classifies all 100 cases to be intrusions, we will achieve a 100% recall but terrible precision= 5/(5+95) = 5%. Such low precision is not desirable either. A few false positives are acceptable, but in this case there are 95, which is too much. So, in such applications, we try to achieve recall to be very high but without precision being too bad.

b. Consider a device such as Amazon Alexa. Such a device triggers on voice command — “Alexa”. In this case, we wouldn’t want the device to get triggered every now and then on background speech. This means we want the False Positives (FP) to → 0. Here we want precision to be very high so that the device doesn’t trigger randomly. In this case, it is okay to have some False Negatives (FN) i.e. the device not responding to you when you say “Alexa” i.e. we try to achieve precision to be very high but without recall being too bad.

4. Specificity — Of all the samples that are negative (TN + FP) i.e. “Actually Negative”, how many were correctly classified as negative (TN).

It is the opposite of what recall is. Just interchange Positive ↔️ Negative.

5. F1 Score — Harmonic mean of Precision and Recall. It is a single score that represents both Precision and Recall.

So why Harmonic Mean, why not Arithmetic Mean? Because it punishes extreme values more.

As explained here, consider a trivial method (e.g. always returning class A). There are infinite data elements of class B, and a single element of class A:

Precision: 0.0
Recall:    1.0

When taking the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome! But, with the harmonic mean, the F1-score is 0.

Arithmetic mean: 0.5
Harmonic mean:   0.0

In other words, to have a high F1-Score, you need to have both high precision and recall.

6. AUC-ROC (Area Under Curve — Receiver Operating Curve)

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. Axes for ROC are TPR (True Positive Rate) and FPR (False Positive Rate).

The curve tells us about how good the model can distinguish between the two classes. Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties distinguishing between the two.

Ideally we want AUC ROC to be → 1.0

The key point to note is that the area under the curve (AUC) is the highest when the two distribution curves (of each class) are farthest with little overlap.

ROC AUC demo — Source http://www.navan.name/roc/

Here is a good article explaining AUC-ROC in detail.

7. Logloss — Log Loss quantifies the accuracy of a classifier by penalizing false classifications. Mathematically Log Loss is defined as —

Where N is the number of samples or instances, M is the number of possible labels, y is a binary indicator of whether or not label j is the correct classification for instance i, and p is the model probability of assigning label j to instance i.

Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.Log operation is natural base.

Logloss is easier to understand for two class scenario (i.e. M = 2). The formula reduces to —