Evaluation Metrics for Classification Models Series — Part 1:

Ana Preciado
The Startup
Published in
5 min readAug 16, 2020

The evaluation metrics for classification models series consist of multiple articles linked together geared to teaching you the best practices in evaluating classification model performance.

For our practice example, we’ll be using the breast cancer dataset available through “sklearn”. We take the following steps in preparing our data:

Next, we will proceed to split our data and train a binary classification model to evaluate.

We will create a table called evaluation_table with the actual and predicted values.

We look up the percent of malignant vs benign observations.

By looking at the percent of total observations in our sample that have actual (real) malignant cancer vs the percent that have benign, we analyze how balanced our sample truly is.

The imbalance we are dealing with is truly just 65:35. It is slightly imbalanced but is not of much concern. In a future article, I’ll be illustrating how to deal with highly imbalanced datasets.

Accuracy:

The proportion of the observations that were predicted correctly.

Accuracy is viable to use when the dataset has enough positive and negative observations. In this example, it is a viable metric to use because we have 65% malignant and 35% benign).

Accuracy stops being a viable evaluation metric when we have an ‘imbalanced dataset’, this is a dataset with considerably more negative than positive values or with considerably more positive than negative values.

why?

because the model will likely predict almost all the observations to be positive (or negative — depending on the direction of the imbalance), and be correct in most cases. If this happens The model will have a high accuracy regardless of its performance predicting the less represented negative (or positive) observations.

Thresholds:

We can set our binary classifier model to predict a probability instead of an integer. We do this by using the predict_proba() method of our model instead of predict().

The result from using predict_proba() will be a 2D array with two columns: the probability of the cancer being benign, and the probability of the cancer is malignant. They should add up to 1 when summed. In order the check the order of the columns in the resulting array, we look up the clf.classes_ method.

We will be creating a column in our already existing evaluation data frame with the probability of the cancer being malignant (the second column of our array).

When we begin prediction using the probability of an event taking place instead of looking at a binary prediction, we have the flexibility of choosing the probability threshold to be the cutoff point between 0 and 1. The ideal cutoff point will vary depending on the problem statement at hand, and the business needs the model is looking to solve.

For example, if the benefit of a True Positive is large, and the cost of a False Positive is low, there would be more benefit in choosing a low cutoff point (example 0.2, 0.3) than to choose a high cutoff point (example 0.7, 0.8).

Nevertheless, to more easily compare the performance of two models, we want an evaluation metric that works independently from the chosen cutoff point.

ROC, AUC, and Gini:

The ROC curve plots the relationship between the True Positive Rate (Precision) and the False Positive Rate (1-Specificity). In other words, we are comparing the proportion of correctly predicted positive observations (TP/(FP+TP)) vs the proportion of negative observations that were wrongly predicted (FP/(TN+FP)).

why?

Because the ideal model would outperform other models by having a higher Precision (TPR), given the FPR (proportion of negative values that were wrongly predicted positive), regardless of the chosen threshold.

If we choose a threshold of 0.80, a higher number of observations will be predicted negative (predicted 0 — predicted that the cancer is benign). This would be that a higher proportion of the actual negative values is being predicted correctly (because at that threshold we are predicting a larger number of observations to be negative). This translates into a low FPR because the proportion of wrongly predicted actual negative values (predicted to be malignant/positive/1, but that was actually benign/negative/0) becomes lower.

Likewise, if we chose a lower threshold (example 0.20), we have the opposite effect. A higher proportion of the observations are being predicted positive/malignant/1, and therefore we have a higher FPR (a higher proportion of actual negative/benign/0 observations being wrongly predicted as positive/malignang/1).

The AUC is the area under the ROC curve. The higher the area under the curve the better the predictions are. Among the advantages of using this metric we find:
— It comparable across models of different sample scales (it speaks in proportions of the sample)
— It offers a comparison across models regardless of the chosen threshold

--

--