Interpreting Performance Evaluation Metrics for Classification Problems

Published in

HYPATAI

4 min readAug 27, 2020

There are several methods when it comes to evaluating model performance for classification problems. This story is based on understanding the idea behind the metrics of evaluation and how to interpret them. Note that descriptions are made under the assumption that we are dealing with a binary classification problem.

AUC is a good indicator of the performance of classification problems. It aims to maximize the distinguishing ability of the model between classes. Two important calculations that are needed for computing AUC are TPR and FPR. These are obtained from the confusion matrix according to the following formulas:

Let’s remember the income level classification problem explained on Hypatai here. Our goal was correctly classifying people with income ≥50K or <50K. If we say that ≥50K is a positive class and <50K is the negative one, then TPR (or recall) is correctly predicted positives over actual positives. In other words, it is a measure which says, of all the people with actual income ≥50K at what percentage we correctly predict income ≥50K. FPR, on the other hand, means of all the people with actual income <50K, at what percentage we erroneously predict income ≥50K. As shown in the example confusion matrix above, FPR is positively correlated with type-1 error, and TPR is negatively correlated with type-2 error.

The ROC curve is plotted with TPR versus FPR. It is plotted for various threshold settings. Area Under Curve of ROC is called AUC. The higher the ratio of TPR/FPR for each threshold, the further the ROC curve is from the flip coin line, meaning the model does a good job distinguishing negative and positive classes. Another intuition is, we expect both the type-1 and type-2 errors to be small when this ratio is high.

Below are the AUC values for validation and test sets for our problem. The area under the flip coin line is 0.5, and the calculated AUC values are 0.932 and 0.926 for validation and test sets respectively. It is a good sign that we have our calculated values close to 1, and away from 0.5. Our model performed well and successfully distinguished the classes.

Remember that our model outputs a probability score, not the predicted classes themselves. Therefore, we need to determine an optimal score threshold to obtain the classes. A good approach that I have chosen is to maximize the difference between TPR and FPR and select the corresponding threshold which maximizes this value. The function named as “prediction_with_optimal_auc” is used to calculate the optimal threshold and the rest of the code could be found on my github page here. As seen below, the red dot displays the optimal point for the validation set. It is no surprise that this point is the farthest point to the 45-degree line.

Of course, there are other approaches to determine the optimal threshold. Another way to get this value could be to find the one which maximizes the F-1 score. For instance, H2O automatically constructs confusion matrices after selecting the threshold based on the maximum F1 score.

There will be other stories on Hypatai, for modeling the income level prediction problem on the H2O framework and we will talk about it in more detail later.

Point for optimal threshold is calculated as 0.2050605 from the validation set, meaning predicted scores above this value will be assigned as positive class and those which are predicted as negative class. Now we are ready to construct our confusion matrix.

Our confusion matrix for the validation set looks like on the left. Notice that we have calculated TPR as 0.89 and FPR as 0.19. It seems that our model correctly classified 0.89 of actual income with ≥50K. However, although we have calculated a high AUC value, an FPR value close to 20 % may not be low enough for insight of the company. In this case, they may go back to their model building phase and optimize their model, or model results. For our case, we accept the above results and we continue constructing the confusion matrix for the test set using the optimal threshold determined from the validation data.

On the left side of the page, let’s see the results for the test set. TPR is calculated as 0.88 and FPR is 0.19. Obtained ratios show that we have a strong model and our model performs well also for data that it has never seen before.

Stay tuned and don’t stop following Hypatai!

Interpreting Performance Evaluation Metrics for Classification Problems

Written by Özge Ersöyleyen