DATA SCIENCE THEORY | MODEL EVALUATION | KNIME ANALYTICS PLATFORM

Confusion Matrix and Class Statistics

Wheeling like a hamster in the Data Science Life Cycle? Don’t know when to stop training your model?

Maarit Widmann
Low Code for Data Science

--

Model evaluation — or model scoring — is an important part of a data science project and it’s exactly this part that quantifies how good your model is, how much it has improved from the previous version, how much better it is than your colleague’s model, and how much room for improvement is still there.

In this article we talk about the confusion matrix — a compact representation of the model performance, and the source of many scoring metrics for classification models.

A classification model predicts two or more known classes, for example, customer churn/no churn, spam/normal email, red/white wine, or malignant/benign tumor. The confusion matrix shows the distribution of actual and predicted classes, and it is the starting point in evaluating a classification model of any nature. The classes can be equally important, for example, when classifying wines to red and white. Or, predicting one class correctly can be more important, for example, when trying to find the malignant tumors. In the latter case we’re often dealing with imbalanced data and trying to predict the minority class correctly.

Email classification: spam vs. useful

Let’s take a look at an example classification problem where the goal is to classify incoming emails into two classes: spam and useful (or “normal”). For that, we use the Spambase Data Set provided by UCI Machine Learning Repository. This dataset contains 4601 emails described through 57 features, such as text length and presence of specific words like “buy”, “subscribe”, and “win”. The “Spam” column provides two possible labels for the emails: “spam” (1) and “normal” (0).

Figure 1 shows a workflow that covers the steps to build a classification model: reading and preprocessing the data, partitioning into a training set and a test set, training the model, making predictions by the model, and evaluating the prediction results.

Download the workflow Evaluating Classification Model Performance from the KNIME Hub.

Figure 1: A KNIME workflow building, applying and evaluating a supervised classification model that classifies incoming emails as spam or useful. Download the workflow for free from the KNIME Hub.

The last step, model scoring, is based on comparing the actual and predicted target column values in the test set. The whole scoring process of a model consists of a match count: how many data rows have been correctly classified and how many data rows have been incorrectly classified by the model. These counts are summarized in the confusion matrix.

By looking at the confusion matrix we can then see the performance of the model in absolute numbers, for example, how many of the actual spam emails were predicted as spam. On the other hand, a number of class statistics and overall accuracy statistics, which show the performance of the model as relative measures, are calculated based on the numbers in the confusion matrix. The confusion matrix and class statistics are displayed in the interactive view of the Scorer (JavaScript) node as shown in Figure 2.

Figure 2: Confusion matrix and class statistics in the interactive view of the Scorer (JavaScript) node.

Confusion matrix

Let’s see now what these numbers in a confusion matrix are.

The confusion matrix was initially introduced to evaluate results from a binomial classification. Thus, the first thing to do is to take one of the two classes as the class of interest, i.e. the positive class. In the target column, we need to choose (arbitrarily) one value as the positive class. The other value is then automatically considered as the negative class. Keep in mind that the class statistics will show different values if we change the positive class. Here we chose the spam emails as the positive class and the normal emails as the negative class.

The confusion matrix in Figure 3 reports the count of:

  • Spam emails classified correctly as spam (the positive class). These are called True Positives (TP). The number of true positives is placed in the top left cell of the confusion matrix.
  • Spam emails classified incorrectly as normal (the negative class). These are called False Negatives (FN). The number of false negatives is placed in the top right cell of the confusion matrix.
  • Normal emails classified incorrectly as spam. These are called False Positives (FP). The number of false positives is placed in the lower left cell of the confusion matrix.
  • Normal emails classified correctly as normal. These are called True Negatives (TN). The number of true negatives is placed in the lower right cell of the confusion matrix.

Therefore, the correct predictions are on the diagonal with a gray background; the incorrect predictions are on the off diagonal with a white background:

Figure 3: A confusion matrix showing actual and predicted positive and negative classes in the test set.

Measures for Class Statistics

Now, using the four counts in the confusion matrix, we can calculate a few class statistics measures to quantify the model performance.

The class statistics, as the name implies, summarize the model’s performance for the positive and negative classes separately. This is the reason why their values and interpretation change with a different definition of the positive class and why they are often expressed in pairs: sensitivity & specificity and recall & precision. These pairs of statistics provide a more comprehensive view of the model’s performance.

Notice that both pairs of statistics are characterized by an inverse relationship: improving one often happens with the cost of reducing the other. For example, if we use a stricter spam filter, we’ll reduce the number of dangerous emails in the inbox, but increase the number of normal emails that have to be collected from the spam box folder afterwards.

Sensitivity and Specificity

Figure 4: Sensitivity and specificity values and their formulas, which are based on the values in the confusion matrix.

Sensitivity measures the model’s prediction performance for the positive class. So, given that spam emails are the positive class, sensitivity quantifies which proportion of the actual spam emails are correctly predicted as spam.

We divide the number of true positives by the number of all positive events in the dataset: the positive class events predicted correctly (TP) and the positive class events predicted incorrectly (FN). The model in this example reaches the sensitivity value of 0.882. This means that about 88 % of the spam emails in the dataset were correctly predicted as spam.

Specificity measures the model’s prediction performance for the negative class, so which proportion of the actual normal emails are correctly predicted as normal.

We divide the number of true negatives by the number of all negative events in the dataset: the negative class events predicted incorrectly (FP) and the negative class events predicted correctly (TN). The model reaches the specificity value of 0.964, so less than 4 % of all normal emails are predicted incorrectly as spam.

Recall, Precision and F-Measure

Figure 5: Recall and precision values and their formulas, which are based on the values shown in the confusion matrix.

Similarly to sensitivity, recall measures the prediction performance for the positive class. Therefore, the formula and interpretation for recall is the same as for sensitivity.

Precision measures the prediction performance of the positive class. That is, which proportion of the predicted spam emails are actually spam emails.

We divide the number of true positives by the number of all events assigned to the positive class, i.e. the sum of true positives and false positives. The precision value for the model is 0.941. Therefore, almost 95 % of the emails predicted as spam were actually spam emails.

Recall and precision can also be reported by one measure that combines them. One example is called F-measure, which is the harmonic mean of recall and precision:

Multivariate Classification Model

In case of a multinomial classification model, the target column has three or more values. The emails could be labeled as “spam”, “ad”, and “normal”, for example.

Similarly to a binomial classification model, the target class values are assigned to the positive and the negative class. The difference is that multiple classes must be assigned the same label, positive or negative. Here we define spam as the positive class and the normal and ad emails as the negative class. Now, the confusion matrix looks as shown in Figure 6.

Figure 6: Confusion matrix showing the distribution of predictions to true positives, false negatives, false positives, and true negatives for a multinomial classification model (3 classes).

To calculate the class statistics, we have to re-define the true positives, false negatives, false positives, and true negatives using the values in a multivariate confusion matrix:

  • The cell identified by the row and column for the positive class contains the True Positives, i.e. where the actual and predicted class is spam.
  • Cells identified by the row for the positive class and columns for the negative class contain the False Negatives, where the actual class is spam, and the predicted class is normal or ad.
  • Cells identified by rows for the negative class and the column for the positive class contain the False Positives, where the actual class is normal or ad, and the predicted class is spam.
  • Cells outside the row and column for the positive class contain the True Negatives, where the actual class is ad or normal, and the predicted class is ad or normal. An incorrect prediction inside the negative class is still considered as a true negative.

Now, these four statistics can be used to calculate class statistics using the formulas introduced in the previous section.

Summary

In this article, we’ve shown how to evaluate a classification model with the confusion matrix and class statistics. The confusion matrix lays the first stone in the evaluation of a classification model by showing the counts of correct and incorrect predictions into the target classes. The class statistics, such as sensitivity and specificity, recall and precision, and the F-measure, are calculated based on these counts.

Confusion matrix and class statistics have been defined for binomial classification problems. However, we have shown how they can be easily extended to address multinomial classification problems.

— — — — -

As previously published on the KNIME Blog: https://www.knime.com/blog/from-modeling-to-scoring-confusion-matrix-and-class-statistics

--

--

Maarit Widmann
Low Code for Data Science

I am a data scientist in the evangelism team at KNIME; the author behind the KNIME self-paced courses and a teacher at KNIME.