How To Choose the Right Evaluation Metric in Balance Target Dataset Classification Task?

6 min readAug 24, 2022

In the case of classification, the selection of the right metric is very important, incorrectly selecting the appropriate metric will result in a poor model. If you choose the wrong metric to evaluate your models, you are likely to choose a poor model, or in the worst case, be mistaken about the expected performance of your model. This article will discuss several examples of selecting supervised learning evaluation metrics on the balanced dataset. Before digging further into evaluation metrics, we must know what balanced and unbalanced data are.

The classification of balance and imbalance lies in the comparison or ratio of the number of target variables in the classification case. We will easily use an example of a binary classification case, Binary classification is the task of classifying the elements of a set into two groups based on a classification rule. Examples are predicting whether someone has cancer or not, predicting someone is a good loan borrower or not, and so on. Then what is called a balanced dataset? For example, if we use the case of predicting cancer or not, the comparison of the number of people who have cancer and healthy people are balanced, and the number of comparisons is debatable. A balanced dataset usually has a target variable ratio of 40:60, 35:65, or 50:50. Then what about the unbalanced dataset? the ratio is also debatable, but usually 90:10, 85:15, and 93:7. If in the case of selecting evaluation metrics, determining the dataset in a state of balance/balance is important. But in this article, we will only focus on the evaluation metrics of the Balance Dataset

The first thing that we have to understand is the confusion matrix. The Confusion matrix is a 2 x 2 Matrix, one of the used for finding the correctness and accuracy of the model. It is used for the Classification problem where the output can be of two or more types of classes. Suppose we want to predict if someone has cancer or not, but before that, using the cancer prediction case, let me introduce new terms: TP, TN, FP, FN

True Positive (TP)

Let’s start with the term TRUE, so basically we can say that if our prediction is TRUE (correct) to predict POSITIVE class. Using the cancer prediction case, our POSITIVE class is someone who has cancer, and the negative class is someone who has no cancer. In this case, True Positive is a condition when our model can predict someone who has cancer correctly.

True Negative (TN)

In this case, True Negative happens if our prediction is TRUE (correct) to predict the NEGATIVE class. In this case, the negative class is someone who has no cancer, so in simpler explanation, True Negative is the condition when our model can predict someone who has no cancer correctly.

False Positive (FP)

This is a little bit different, False Positive is a condition if our prediction is incorrect (FALSE) to predict POSITIVE class. If the positive class is someone who has cancer, In this case, False Positive is when our model predicts someone has cancer (POSITIVE) but FALSE. Instead, the fact is that person has no cancer

False Negative (FN)

Basically, in a simple term, a False Positive is the condition when our prediction is incorrect (FALSE) to predict the NEGATIVE class. In the cancer prediction case, our model predicts someone has no cancer (NEGATIVE) but FALSE, In reality, that person has cancer.

So, the question is, how do we know which classes those values belong to? How do determine positive and negative classes? Simple explanation, positive class is the class we want to predict, basically it’s the class we want to focus on. To make it easier to understand, let us some examples. In the cancer prediction case, we want to focus on someone who has cancer and it becomes a positive class, in the customer churn prediction case, we want to focus on the churned customer and it becomes a positive class, and last one in the employee promotion prediction case, we want to focus on promoted employee and it will become the positive class.

Then let’s jump into our main topic, evaluation metrics. We want to discuss accuracy, precision, recall, and f1 score from a balance dataset perspective.

Accuracy

This is a very basic evaluation score, it’s the total of true predictions divided by all predictions.

Precision

Precision is the proportion of the actual positive prediction vs total positive prediction. In the cancer prediction cases, precision is the proportion of people who correctly predicted cancer vs people who predicted cancer (either correct or incorrect).

When did we suggest using precision in a data science project? actually, it depends on the business case, let me give the example. If we are working on a customer churn model, we actually want to minimize False Positive, because we want to minimize customers who are predicted to churn but in fact, they are not churning. Why want to minimise it? The main reason is in the business case, usually, to deal with churn customer (predicted as churn customer) we want to keep them as customer trough discount offers, so basically company spend budget to retain them and if we have a big false positive number, it means the company should spend more to retain on the customer who actually not churns but predicted as churn, it’s loss and ineffective spend.

Recall

A recall score is the proportion of actual positive prediction vs total actual positive. In the cancer prediction case, recall is the proportion of people who correctly predicted cancer vs the total number of people who has cancer.

To answer the question “when did we suggest using recall in a data science project?” the answer remains the same, depending! For example, In a customer campaign purchase prediction case, predict whether the customer will purchase or not. Our positive class is customers who purchase/make a transaction. Then our evaluation metric will be the recall score, the main reason is we want to minimise False Negatives. Our false negative here is the customer who predicted not purchase but actually they are making purchases. Our concern here is if customers predicted not purchase, the company will treat them a little bit special, they are got discount offers or another promotion, but the main goal here is to make them make a purchase. So if we have a lot of False Negatives, basically we waste the company’s marketing budget.

F1 Score

F1 score measures the harmonic mean of precision and recall. So basically F1 score sees both of the classes (binary classification case) are of the same urgency.

Suppose a bank company wants to predict the effectiveness of its telemarketing strategy, they want to predict if customers will purchase the loan or not. But in this case, the bank considered customers who purchase the loan and customers who don’t purchase the loan at the same urgency, so in this case, we should use the F1 score to measure it. Usually, the F1 score is used in the imbalance class classification tasks.

In the next following articles, we want to talk about evaluation metrics in imbalance classification supervised learning. Hope this article helps helpful, see you!

Original Article : https://brian-insights.site/evaluation-metrics/

How To Choose the Right Evaluation Metric in Balance Target Dataset Classification Task?

Accuracy

Precision

Recall

F1 Score

Written by Erlando