The Discussion about frequently used metrics on Classification Problems

Baris Gül
5 min readNov 23, 2022

--

I wanted to focus on Evaluation Metrics for Classification Problems in this article. Of course, there are many resources to understand these metrics. What we want to focus on exactly is ;

— What are these shortly?

— When or Where to use and which is when important?

— In which case is which used?

Some similar questions are made clear.

1) Confusion Matrix

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model.

A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It is used to measure the performance of a classification model. It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.

Confusion Matrix

2. Classification Measure

Accuracy

  • It is useful and meaningful when classification problems are well balanced and not skewed or No class imbalance.

Examples

  1. An asteroid will hit the earth. Although our accuracy is % 99, but not at all valuable.

Precision

  • It is the answer to what proportion of predicted Positives is truly Positive? It is used when we want to be very sure of our prediction.

Examples

  1. We want to see if we should decrease the credit limit on a particular account. In this case, we want to be very sure about the prediction to avoid customer dissatisfaction.
  2. Assume that we want to find out Spam-Mails, in this case, precision must be optimal because we shouldn’t overlook actual Mails.

Recall (Sensitivity)

It answers to question of what proportion of actual Positives is correctly classified. It is a very useful measure when we want to capture as many positives as possible.

Examples

  1. We want to predict cancerous persons. We want to capture disease even if we are not very sure.
  2. We want all potential customers. In this case, we can get potential customers as much as possible. The prediction has not been completely correct. But although they are all not correct we have all potential. In this case, precision is low but recall is high. Assume we want to print out the show card. The money we spent doesn't matter. I can print out 1000 show cards for only 100 consumers. What matters is to get all customers.
  3. We want to determine all COVID-19 residents to avoid infection. We can predict most of the residents as COVID-19 residents, although they are not. But all COVID-19 residents will be inside this group. It is the Focus.
  4. In another issue, we have to determine bad loans. If we don’t determine all bad loans, they can be perceived due to not paid loans. In this case, the bank will wait until they will pay. Because we have to define all bad loans. If we give weight to precision here, we can accept not paid loans as bad loans. It’s not what we wish.

— Last explanation for both of them is:

If we want to determine the poisonous foods by high “recall”, we might say poisonous foods for non-poisonous foods too. In this case, we will waste them and a cost arises.

If we want to determine the poisonous foods by high “precision”, we might leave poisonous foods inside non-poisonous foods. In this case, a poisoning might occur.

“Decide now on your own which cost is more dangerous.”

F1 Score

It is the harmonic mean of precision and recall between 0 and 1. We want to have a model with both good precision and recall. If your precision is low, the F1 is low, and if the recall is low again your F1 score is low.

Examples

  1. We are looking for an answer to the question of whether An asteroid will hit the earth. If it is “NO” presicion is 0. The recall is also 0. Hence the F1 score is also 0.
  2. As police, you want to catch criminals. You want to be sure that the person you catch is a criminal (Precision) and you also want to capture as many criminals (Recall) as possible.
  3. The same example (Bad-Loan) could be also given to the F1 Score. With recall, we want to determine all potential bad loans and with precision, we want to determine all actual bad loans. The Ratio of both should be also good for the goal. Recall should be low and precision should be high. A good F1 score means that you have low false positives and low false negatives, so you’re correctly identifying real threats and are not disturbed by false alarms. An F1 score is considered perfect when it’s 1, while the model is a total failure when it’s 0.
  4. We have a disease. If we couldn’t determine all diseases it will be dangerous, that’s why we have to use “recall”. Assume that we will be cured by a certain treatment. But this treatment can be detrimental to those without the disease. In this case, we need a model that is sensitive to detecting positive cases and equally precise in its detection. That’s when the F1 Score comes into play. F1 Score is the harmonic mean of precision and recall, an average between precision and recall ratios.

AUC

AUC is the area under the ROC curve. AUC ROC indicates how well the probabilities from the positive classes are separated from the negative classes. Here we can use the ROC curves to decide on a Threshold value.

The choice of threshold value will also depend on how the classifier is intended to be used.

AUC

Micro Average

We reviewed how to choose evaluation metrics for our binary classification data. But what do we do when our target is not yes or no but comprised of multiple categories? One way is to count each outcome globally regardless of within-class distribution and calculate the metric. We can accomplish this by using the micro average.

Macro Average

Another method to deal with multi-class is to simply calculate the binary measures for each class. For example, if our target variable can be either cat, dog, or bird, we get a binary yes or no answer for each prediction. Is this a cat? Is this a dog? Is this a bird? This will lead to as many scores as a number of our target classes. Then we can aggregate these scores and turn them into a single metric using a macro average or weighted average.

Conclusion

An important step while creating our machine learning pipeline is evaluating our different models against each other. A bad choice of an evaluation metric could wreak havoc on your whole system.

So, always be watchful of what you are predicting and how the choice of evaluation metric might affect/alter your final predictions.

Also, the choice of an evaluation metric should be well aligned with the business objective; hence, it is a bit subjective.

Kind regards

Baris Gül

--

--