Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Practical Classification Metrics

13 min readApr 26, 2021

--

Photo by Lachlan Donald on Unsplash

A quick google search will tell you there are hundreds of different papers/blogs/articles on ‘classification metrics’, viz., precision, recall, F1, True Positive Rate, False Positive Rate, PPV, NPV, etc. These articles are a great starting point if you don’t understand the definitions of these metrics. For instance this blog talks about various classification metrics in sklearn; this blog charts out 19 different metrics for binary classification (many of them I have never even heard of); this takes a ‘confusion matrix’ point of view to explain some metrics; and then there are blogs for niche cases like imbalanced classification. But, these blogs either present too much information — leaving the onus of selecting the right metrics to the user, or provide an academic view of metrics — which might or might not be actionable in an industrial setting (e.g., using AUC metric can’t give an operating point). Here I provide a recipe for deciding which metrics to use for various scenarios in practical (binary and multi-class) classification problems. It is intended for users who know the basic definitions, but don’t know how to pick the metric for the desired impact.

Here I provide a recipe for deciding which metrics to use for various scenarios in practical (binary and multi-class) classification problems.

Q: What metrics to choose? A: Try Precision-Recall-Coverage (PRC) spectrum.

Given the popularity of Machine Learning (ML) in the corporate world, many product managers are now familiar with the concept of precision-recall, or precision-coverage, often confusing between the two. The problem is that even after working in the space of machine learning for 15 years, I get confused with what does precision ‘exactly’ mean for the problem at hand. In general Precision means ‘what percentage of the predictions is correct’. But, one can come up with two very different definitions of precision for a multi-class problem — both valid interpretations of ‘percentage correct predictions’. Recall and Coverage are two other metrics which are often (incorrectly) used interchangeably. Recall is essentially ‘what percentage of the primary class was I able to retrieve’, and Coverage is essentially ‘what percentage of the population did I make a prediction for’. There are times when one of them does not make sense for the problem, e.g., it is not common to use coverage in a binary setting (although I will show that it makes a lot of sense from a practical point of view).

So how does one go about deciding which metric to choose — by being customer backwards. First understand what does the metrics mean in different problem settings (which is where this blog will help you); next, decide what is the implication of these classification metrics on the application/end customer. For example, when the task is classification of products into a set of categories, you might ask — Am I using the classification system for automated classification or as an input to a manual workflow? Is there a particular category that is of special interest to me, or are all categories equally important? Or maybe the more popular classes, i.e., classes with higher incidence rate, should get higher weights? Is it okay to not classify some of the products? How about not supplementing with manual labels for low confidence items?, etc. The answer to some of these questions will define how to go about choosing the right metric. Let us take a stab at understanding various metrics in different scenarios.

A word of caution before we deep-dive — This blog primarily focuses on various precision-recall-coverage metrics for classification problems. In my experience they are more actionable and have simple interpretation as far as impact to the end customers is concerned. Other metrics such as AUC, ROC, TPR, FPR, etc., may find use in specific contexts, e.g., I would still prefer PR-AUC or ROC-AUC to decide between two algorithms, as they are more robust measures across various operating points, but precision-recall-coverage-based metrics are a good starting point for most practical classification problems, where the classification outputs are used in some downstream applications.

precision-recall-coverage metrics are more actionable and have simple interpretation as far as impact to the end customers is concerned.

Precision-Recall-Coverage metrics are very practical for Classification Problems in business settings.

TL;DR

The below table summarizes various metrics that can be used in different settings:

Binary Classification

Table-1: Various Precision-Recall-Coverage Metrics for Binary Classification. Usually, binary classifiers have the primary class which is the main class of interest. E.g., spam vs no-spam, will have ‘spam’ as the primary class. Column-1 are the various PRC metrics that can be used. Column-2 defines on which metric to choose the ‘operating point’. Column-3 are the desired primary metrics for the operating point that a user needs to input, and Column-4 provides an insight into how the corresponding metric can be computed.

Multi-Class Classification

Table-2: Various Precision-Recall-Coverage Metrics for Multi-class Classification. Column-1 are the various PRC metrics that can be used. Column-2 defines on which metric to choose the ‘operating point’. Column-3 are the desired primary metrics for the operating point that a user needs to input, and Column-4 provides an insight into how the corresponding metric can be computed.

Multi-Label Classification

In next blog.

Binary Classification

Binary classification is performed when you have two classes. Usually, one of the two classes is the primary class (say with a label ‘1’) and the second class is the absence of the primary class (say with a label ‘0’), e.g., spam vs no-spam, electronics vs non-electronics. In some cases, both classes might be of interest, e.g., men’s vs women’s. But, in both scenarios only a single classifier is learned that separates the two classes. Let us say that there are ‘N’ items in the test set, and we store the predictions for both the classes. This will result in an output matrix ‘O’ which is ‘N’ x 2 dimensional.

Looking at the Table-1 above, we can compute three different variants of Precision-Recall-Coverage metrics, which uses different operations on the output matrix ‘O’. For reference, we can compute the ‘accuracy’ metric by computing the max of the two scores, and comparing the corresponding predicted label with the ground-truth.

Figure-1: Various output scores. a) Input Data with ground truth, b) output scores for both the classes, c) Score for the primary class (column-2), and d) Max per-row and the corresponding index of max as the predicted label.

For instance, for the above toy problem, the classification accuracy is 75%, which is 6/8 as 6 labels match from a total of 8 labels, independent of it being 0 or 1. Also note that in this blog I have taken a simplistic version of accuracy where max of the two scores is the predicted output, i.e., a threshold of 0.5. It is perfectly valid to choose different thresholds which will give different accuracy metrics. But, it is better to define PRC based metrics in such scenarios. Accuracy is still a popular choice in academic settings, especially when the two classes are balanced in the dataset.

Precision-Recall for the Primary Class

Precision-recall (PR) curves in the binary setting is one of the most common metrics for binary classification. One can compute two different PR curves, one each for each of the classes. Although usually it is computed only for the primary class. Again, in most problems achieving both high-precision and high-recall is not possible and one has to decide which has the ‘desired impact’ on the downstream application — precision or recall, or combination of the two (e.g., F1-score). For example, for the product classification problem, if you are using your classifier to enable better ‘browse’, i.e., a customer selects a category (say ‘laptop’) and the system shows all the products labeled with the category (e.g., show all laptops), then compromising recall for precision might make more sense if you don’t want to show ‘desktops’ instead ‘laptops’ (this is what happens on google search anyways). One example where a high-recall regime might be desired, is say ‘offensive’ products such as guns, drugs, etc. Even showing one offensive product can lead to trust/public relations issues, so if you are building an offensive product classifier, you might want to operate at a higher recall. If you don’t have a clear picture of what is more important for you, a good option is to find the operating point with the highest F1-score. F1-score is defined as (2 * (Precision * Recall))/(Precision + Recall). Again, you need to deep dive how is the classifier being used in downstream application/how will it impact the end customer to decide between high-precision or high-recall or high-F1 impact.

Figure-2: [Left] Scores and labels sorted based on primary score, [Middle] Precision-Recall plot for the toy problem, [Right] Precision and Recall for different choice of thresholds.

For the toy problem, Figure-2 (left) shows the scores ranked basis the primary class, (middle) the Precision-Recall curve and (right) the precision and recall at different thresholds. The threshold plot is needed for deciding an operating point. Say, we want to operate at 90% precision, looking at the PR plot, we can see that the recall is 50%. The corresponding threshold is ‘0.6’, which means that we should classify an example as the ‘primary’ class if the score of the classifier is ≥0.6 and the rest as the ‘secondary’ class. The plots are themselves interpolated, as for this toy problem we have only 8 examples. The best F1-score of 0.75 is achieved at P=0.75, R=0.75, th=0.3.

Note that PR curves are sensitive to the class proportions in the test set. Two different PR curves cannot be compared across two different test sets, if the class proportions are different. A test-set with a 50:50 proportion will have higher scores in general as compared to a test-set with a 10:90 proportion for the primary class. This is where some people might prefer TPR/FPR based metrics as they normalize for the class imbalance. But, if in a practical setting you expect the imbalance to be present, PRC metrics might be more interpretable.

Precision-Recall-Coverage for the Primary Class

Figure-3: Precision-Recall-Coverage for the Primary Class

In the above scenario, we ‘always’ made a decision into primary or secondary. At any operating threshold (th), be it at a particular precision, or recall or based on best F1-score, all test items ≥th, are labelled as ‘primary’ and <th as ‘secondary’. What if we are given the flexibility to ‘not’ classify the item. Maybe we want to ‘ensure’ that we are operating at both high-precision and high-recall. Maybe we have audit bandwidth to label some items, but want to pick the ones that are most difficult for the system. Given that we have spent enough time optimizing the features, classifier, parameters, training-set, etc., ‘not-classifying’ is one option to ensure both high-precision and high-recall. The %age of items we classify is called coverage in this case. Here the tradeoff is how to maximize coverage, at high-precision and high-recall. The idea is to not classify if the scores are between an ‘upper’ and a ‘lower’ threshold, and compute precision and recall on the remaining samples. For example, in Figure-3, if we choose upper threshold to be ≥0.7 and lower threshold to be <0.1, the we will achieve a P=1.0, R=1.0, and C=3/8=37.5%. In general, it is better to treat the secondary class as an alternate class and adopt the precision-coverage metrics, as shown below instead.

Precision-Coverage for both Primary and Secondary Class (PPV, NPV)

Precision-Coverage for both the primary and secondary class is also a useful metric. It is used in situations where both classes are equally important (e.g., accurately classifying into men’s and women's). They are also called as PPV (positive predicted value) and NPV (negative predicted value) rate. These metrics for binary problems can be computed in two different ways, both leading to the same answer. The first approach is same as that for ‘Precision-Recall-Coverage’ plots — using two different thresholds, upper and lower on the sorted primary scores. Items >upper will be classified as primary and <lower will be labeled as secondary. The second is to compute precision-coverage of the primary and secondary classes independently, and having two thresholds based on the two plots. For instance, using the same thresholds as that in Figure-3, (upper ≥0.7 and lower <0.1), the we will achieve a P_primary=2/2 =1.0, P_secondary=1/1=1.0, and C=3/8=37.5%. An alternate choice could be upper≥0.7, lower≤0.2, which will lead to P_primary=2/2 =1.0, P_secondary=3/4=0.75, and C=6/8=75%. The same metrics can be computed in an alternate way which will help to expand per-class precision-coverage metrics to a multi-class setting. Here, instead of using the primary score, we would sort the max score. Then, using the classification into primary and secondary we would find two different thresholds for the desired precision.

Figure-4: Alternate approach to compute Precision-Coverage for both primary and secondary Class.

As shown in Figure-4, we collect the items classified as primary and secondary separately and then find two different thresholds based on the desired precision for each of the classes. In the Figure, threshold for primary is ≥0.7, and threshold for secondary is ≥0.8 (which is same as ≤0.2 for primary score), leading to P_primary=2/2 =1.0, P_secondary=3/4=0.75, and C=6/8=75%.

Multi-class Classification

A multi-class classification problem is when we want to pick one out of ‘C’ classes as the label. Usually, all classes are of interest. Multi-class classifiers can be trained in a one-vs-all fashion, which amounts to learning one classifier for each class, or as a single joint classifier, e.g., using softmax loss function, which returns a probability distribution. As with binary, ‘classification accuracy’ is the simplest metric one can choose to evaluate a multi-class classifier. There are two variants of classification accuracy — micro and macro. Micro-accuracy averages over each instance which means classes with more instances gets higher weights and consequently contribute more to the performance. To compute Macro-accuracy, we first compute the accuracy per-class and then average it ensuring each class gets the same weightage. For the product classification problem, a micro-accuracy will make sense if you want to ensure the catalog is meeting desired performance. A macro-accuracy would make sense if you want to ensure each category is accurate in the catalog.

Figure-5: 3-class dataset a) Input Data with ground truth, b) output scores for all three classes, and c) Max per-row and the corresponding index of max as the predicted label.
Figure-6: Confusion Matrix

For instance, for the above toy problem, the micro-classification accuracy is 75%, which is 6/8 as 6 labels match from a total of 8 labels. The per-class accuracy is best computed using a confusion matrix as shown in Figure-6. Class-1 Accuracy = 3/4, Class-2 Accuracy = 1/2, Class-3 Accuracy = 2/2, macro-accuracy = (0.75 + 0.5 + 1)/3, which incidentally turns out to be 75% itself.

Precision-Coverage (Global, and Per-class)

One practical metric for multi-class classification problem is to be able to classify the data with a desired precision and computing what percentage of the catalog is accurately classified, i.e., coverage. As with micro- and macro-accuracy, the precision can be computed at a catalog level (where dominant classes will dominate the precision), or if we want every class to be precise, we can opt for picking a threshold such that each class is above the desired precision and then compute the coverage. Both these metrics are computed by sorting the max scores.

Figure-7: [Left] Scores and labels sorted based on max score, [Middle] Precision-Coverage plot for the toy problem, [Right] Precision and Coverage for different choice of thresholds.

For the toy problem, Figure-7 (left) shows the scores ranked basis the max score, (middle) the Precision-Coverage (PC)curve and (right) the precision and coverage at different thresholds. The threshold plot is needed for deciding an operating point. Say, we want to operate at 95% precision, looking at the PC plot, we can see that the coverage is 55% (interpolated). The corresponding threshold is ‘0.7’, which means that we should classify an example if the max score ≥0.7 and the rest as ‘Unable to Classify or UTC’. The plots are themselves interpolated, as for this toy problem we have only 8 examples.

Figure-8: Precision-Coverage per-class.

To compute a per-class threshold, we will pivot the output bases the predicted class and then compute thresholds per class, so that each class is 95% precision. Not that this is similar to Figure-4’s alternate approach to compute precision-coverage for both primary and secondary Class. For the toy multi-class problem, this leads to Class-1 Coverage (at >95% precision)= 3/3, Class-2 Coverage (at >95% precision) = 0/2, Class-3 Coverage (at >95% precision) = 2/2, coverage (at >95% precision) = (3 + 0 + 2)/8 = 62.5%. Not that the increase in coverage is because we for Class-3 we can afford to have a lower threshold of 0.4 as per the toy problem. In real world you might want to have minimum count per class to ensure robustness.

Precision-Recall (Per-class)

One can treat the C-class multi-class classifier as C independent binary classifiers and report precision-recall metrics for each of the binary classifiers, treating a particular class as the primary class and all other classes as secondary.

Once can use the confusion matrix (Figure-6) for a point estimate of PR. For each class, the %instances correctly classified into the class is the precision, and %instances retrieved is the recall. For example, for Class-1, the precision and recall both are 3/4 = 75%, Class-2: P=1/2, R=1/2, and Class-3: P=2/2, R=2/2 (again both Precision and recall are same coincidentally). To compute the entire PR-curve, each column of the output matrix can serve as the score for the primary class. Following the approach similar to binary PR curves, we can construct C different PR curves, one for each of the classes.

Figure-9: [Left] Scores for all the classes, [Middle] Example of sorting based on Class-1 scores, and [Right] Precision-Recall curves for all the classes.

For the 3-Class problem above, Figure-9 (left) shows the scores for all the classes. In the middle, an example of sorting based on Class-1 scores is highlighted, and on the right the Precision-Recall curves for all the classes is plotted.

Multi-Label Classification

In next blog.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

No responses yet