Evaluating classification algorithms

Mélanie Fouesnard
10 min readOct 8, 2023

--

During the previous week, we saw how to perform binary classification to classify customers as churning or not churning.

Now, we want to know if the model performs well and is reliable: how do we do that ?

Good news: specific metrics exist to evaluate our model performance !

We will review some of them here in a simple way:

  • The confusion matrix
  • Precision and recall
  • The ROC and ROC-AUC
  • Cross-validation

If you want more informations about these classification metrics and about other classification metrics, there are many ressources on the internet like this blog article.

Here, we will see what is shared about this topic during the week 4 of the Machine Learning Zoomcamp, by Alexey Grigorev from the DataTalks.Club.

All related content is freely available on GitHub (course, notebooks and even homework with corrections) and the videos are available on YouTube. Do not hesitate to watch them and refer to the notebooks !

The confusion matrix

Remember: we trained our classification model on the training dataset. We can evaluate it on the validation dataset.

We dispose of:

  • True validation target values: these are the target values we know already
  • Predicted validation target values: these values are obtained after running the trained model on the validation features

By comparing them, we can have an idea on how our model performs to correctly predict the two classes (in our example, it is whether the customer will churn (class=1) or not (class=0)).

Ok, but how to compare them ?

In the following animation, inspired from the schemas drawn in this YouTube video, I represented the true and predicted y_val, and separated each between 0 and 1 classes.

Intuitive representation of the difference between true validation target values and predicted validation target values. By superposing the two spaces representing these two target values, we can define four different spaces: true positives (1 predicted as 1), true negatives (0 predicted as 0), false positives (0 predicted as 1) and false negatives (1 predicted as 0).

As you can see, there is a difference between true and predicted y_val: some customers are predicted as 0 but really are 1, while others are predicted as 1 but really are 0. This is represented by the different placement of the separation line between 0 and 1.

We can superpose them and clearly see where the values were correctly predicted (either 0 predicted as 0 or 1 predicted as 1) and where the values were not correctly predicted (1 predicted as 0 or 0 predicted as 1). There are 4 different cases:

  • True positives: true 1 predicted as 1 by the trained model
  • True negatives: true 0 predicted as 0 by the trained model
  • False positives: true 0 predicted as 1 by the trained model
  • False negatives: true 1 predicted as 0 by the trained model

In a real context, we have a corresponding number of good predictions and bad predictions for all the 4 cases. We can put these number in a table: we call it the confusion matrix.

Confusion matrix. The number of true positive values is represented in the first row while the number of true negative values is represented in the second row. The number of predicted positive values is represented in the first column and the number of predicted negative values is represented in the second column. Then, we can deduce the number of True positive, False negative, False positive and True negative. Source of this illustration: https://www.evidentlyai.com/classification-metrics/confusion-matrix

How is it useful in our case ?

We can easily see how our model performs and make the right predictions. If the false negative and false positive numbers are high, this means that the model does not perform well in this classification task.

In the specific case of class imbalance, we can clearly see the impact:

  • Accuracy will be high since there are so many values from one class that there is a high chance to predict them
  • However, the confusion matrix will show that the number of wrong predictions is high for the smallest class

Several classification metrics can be calculated using the confusion matrix. We will see later the precision and recall.

So, ok, now we have the theory. How do we implement it in python ? We can do it using numpy !

I assume that we already have y_val and y_pred as well as a threshold value to determine to which class belong a customer from the predicted value.

import numpy as np

# y_val is here our validation target array (true values)
y_val

# y_pred is here the predicted validation values (predicted values)
y_pred

# We start by getting the indices of true positive and true negative values from the validation dataset
true_positive = (y_val==1)
true_negative = (y_val==0)

# We then get the indices of predicted positive and negative using a specified threshold
threshold = 0.5
predict_positive = (y_pred >= threshold)
predict_negative = (y_pred < threshold)

# We deduce the four classes to build the confusion matrix
tp = (predict_positive & true_positive) # True positive
tn = (predict_negative & true_negative) # True negative
fp = (predict_positive & true_negative) # False positive
fn = (predict_negative & true_positive) # False negative

# We can build the confusion matrix, either with the number of customers directly:
confusion_matrix = np.array([[tp, fn]
[fp, tn]])

# We can build a confusion matrix using customer proportions:
(confusion_matrix/confusion_matrix.sum()).round(2)

Example outputs:

The obtained confusion matrix.
The corresponding proportion from the above confusion matrix.

NB: you can easily plot the confusion matrix using sklearn : https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html. This is easier to read than numpy arrays.

Precision and recall

As said above, we can calculate some useful metrics from the values of the confusion matrix. For example, we can calculate precision and recall.

  • Precision = fraction of positive predictions that are correct
  • Recall = fraction of correctly identified positive examples

Precision and recall give us valuable informations about our model.

Intuitively, we can represent the calculation of precision and recall like this:

Red customers are customers that really churned and black customers are customers that did not churn (true y values). The customers are ordered by predicted y_value (orange triangle) prediction: here, a threshold of 0.7 is placed to indicate where the model predicts the two classes (0 or 1). Customers on the right of the threshold are predicted as churning and customers on the left of the threshold are predicted as not churning. The precision corresponds to the fraction of positive predictions that are correct: here 2 out of 3 positive predictions are correct. The recall corresponds to the fraction of correctly identified positive examples: here, two customers are identified as positive and are positive whereas two other customers are identified as negative but really are positive.

The positive customers that were predicted as positive constitute the true positives (in the illustration, the red customers on the right of the threshold).

The negative customers that were predicted as positive constitute the false positives (in the illustration, the black customer on the right of the threshold).

The positive customers that were predicted as negative constitute the false negatives (in the illustration, the two red customers on the left of the threshold).

By using the values used to build the confusion matrix, we can thus calculate the precision and recall like this:

Source: Wikipedia

Then, we can translate it simply in python:

# Deduce the precision and recall values
precision = tp/(tp+fp)
recall = tp/(tp+fn)

ROC and ROC AUC

We saw that we can calculate the number of true positives, true negatives, false positives and false negatives for one specific threshold.

We can do it for all possible threshold between 0 and 1.

This is used to see how the true positive rate and the false positive rate evolve according to the threshold. Indeed, what we want is to minimize the false positive rate and to maximize the true positive rate to have the best model possible (the minimum errors).

  • True positive rate: also called sensitivity, corresponds to the proportion of positive examples that are classified as positives by the model. It also corresponds to the recall!
  • False positive rate: also called specificity, corresponds to the proportion of negative examples that are classified as positives by the model

We can calculate them directly:

# True positive rate
tpr = tp/(tp+fn)

# False positive rate
fpr = fp/(tn+fp)

By plotting the True positive rate as a function of the False positive rate, we plot the Receiver Operating Curve (ROC curve). Here is an illustration from this article that clearly explains how to interpret the obtained ROC curve:

ROC curve interpretation. A random classifier will have the same true positive rate and false positive rates. A perfect classifier will have a false positive rate equal to 0 while the true positive rate is equal to 1 (we usually represent it as a point). Models are better than random classifiers when the obtained curve is above the random one: this means that the true positive rate is higher than the false positive rate. The higher the curve, the better the result ! Source of the illustration: https://medium.com/@ilyurek/roc-curve-and-auc-evaluating-model-performance-c2178008b02

In python, we can easily plot this curve using sklearn!

from sklearn.metrics import roc_curve
from matplotlib.pyplot import plt


# The function outputs the fpr, tpr and the corresponding thresholds
fpr, tpr, thresholds = roc_curve(y_val, y_pred)

# Corresponding plot
plt.figure(figsize=(5, 5))

plt.plot(fpr, tpr, label='Model')
plt.plot([0, 1], [0, 1], label='Random', linestyle='--')

plt.xlabel('FPR')
plt.ylabel('TPR')

plt.legend()

Example output:

Example output using the values from the Machine Learning zoomcamp course. We can see that the model has a rather good performance, but it could be better. There is still a gap between the ideal model and the actual model !

Another way to interpret and compare the ROC curve is to determine the corresponding Area Under the Curve (AUC).

  • For a perfect model: the ROC-AUC is equal to 1 (it fills completely the plot!).
  • For a random model: the ROC-AUC is equal to 0.5 (the corresponding ROC curve separates in half the plot!).

Concretely, the ROC-AUC tells us the probability that a randomly selected positive example has a score higher than a randomly selected negative example.

In python, the sklearn library provides useful tools:

from sklearn.metrics import auc

# We can directly calculate the corresponding auc using tpr and fpr
auc(fpr, tpr)

Example output: 0.843850505725819.

This means that we have a probability equal to about 0.84 to get a higher score for a randomly selected positive example than for a randomly selected negative example. This probability is quite high, which means that our model rather correctly classifies the customers.

One function allows us to calculate directly the ROC-AUC without having to calculate the FPR and TPR:

from sklearn.metrics import roc_auc_score


# Calculate the ROC AUC directly from the y_val and y_pred values
roc_auc_score(y_val, y_pred)

Example output: the same as above, 0.843850505725819.

Here, we saw how to easily calculate and plot ROC curve and ROC-AUC. If you want to know how to do it using numpy, check this notebook !

Cross validation

Cross-validation is used to see how stable the model is in its predictions. Indeed, we saw how to calculate the ROC AUC (for example) for one case: one training dataset and one validation dataset. However, this could not be representative of all our data: if the dataset used for training and the dataset used for validation were different, maybe we would have a different ROC AUC.

To answer that question, we compute the ROC AUC with several training datasets and several validation datasets. Here is an illustration showing how the data is splitted with 3 different iterations of predictions, here with train and test datasets:

The data are composed of 12 customers for example. They constitute the complete dataset. These customers are shuffled then split considering the proportion of train and test. Then, a model is trained using these splits. This operation is repeated k times, here with k=3 (k represents the number of folds). Source: Wikipedia.

With the 3 trained models, we can obtained 3 different ROC AUC and the corresponding standard deviation, which will be the indicator of our models stability.

We can also use the cross-validation to test different parameters values and see which parameter value gives us the best metric with more confidence that testing it on only one fold.

In python, we can use (again!) a sklearn function:

from sklearn.model_selection import KFold

# We can also import tqdm: this library gives us the time spent running a loop
from tqdm.auto import tqdm

# Definition of our training function, with the regularization parameter C
def train(df_train, y_train, C=1.0):
dicts = df_train[categorical + numerical].to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(dicts)

model = LogisticRegression(C=C, max_iter=1000)
model.fit(X_train, y_train)

return dv, model

# Initiate dv and our model with C=0.001
dv, model = train(df_train, y_train, C=0.001)

# Definition of our predict function that takes as input our data, dv and model
def predict(df, dv, model):
dicts = df[categorical + numerical].to_dict(orient='records')

X = dv.transform(dicts)
y_pred = model.predict_proba(X)[:, 1]

return y_pred

# Set the number of splits (k, number of folds)
n_splits = 5

# Here we will iterate over a set of regularization parameters C
for C in tqdm([0.001, 0.01, 0.1, 0.5, 1, 5, 10]):

# Here we determine how to split our dataset according to the number of splits
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)

# The list that will contain the different scores (one per C value)
scores = []

# For each fold, we get the ROC AUC value of the validation dataset
for train_idx, val_idx in kfold.split(df_full_train):
df_train = df_full_train.iloc[train_idx]
df_val = df_full_train.iloc[val_idx]

y_train = df_train.churn.values
y_val = df_val.churn.values

dv, model = train(df_train, y_train, C=C)
y_pred = predict(df_val, dv, model)

# The ROC AUC is determined and added to the scores list
auc = roc_auc_score(y_val, y_pred)
scores.append(auc)

# For the selected value of C, the different scores obtained and the standard deviation are printed
print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

Output example:

For each value of C, the corresponding mean ROC AUC and the associated standard deviation. We can see that the best ROC AUC is here equal to 0.841+-0.008 for C=0.1, C=1, C=5 and C=10. Here, we can choose the default value of C which is equal to 1, since it will not change the performance of the model.

Now, we have determined the best value of C and we can train our final model with C=1! As Josh Starmer would say, hooray !

# We train our model with C=1.0
dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)

# We compute the corresponding ROC AUC
auc = roc_auc_score(y_test, y_pred)
auc

Example output: 0.8572386167896259.

Our evaluation of a binary classifier journey with the ML Zoomcamp ends here ! I hope you enjoyed it and found some useful informations. If you want to go deeper, you can check the GitHub repository and watch the corresponding videos on YouTube.

PS: did you like this article ? Check the other one about last week: we talked about a binary classification model!

--

--