Introduction to Model Evaluation Methods in Classification Problems

Applied examples in Python: Confusion Matrix, Accuracy, Precision, Recall and F1-Score

Published in

CodeX

5 min readOct 28, 2021

We will try to point out how we can measure the success of classification models and introduction to Logistic Regression by this story.

As aforementioned, we use regression models to predict continuous variables. Although the Logistic Regression contains “regression” in it, we use it to predict categorical variables. So, Logistic Regression solves classification problems. In this story, I used Logistic Regression to get classification scores.

The target class will have 1 in the target variable. For example, if we use the diabetes dataset to predict whether it has diabetes or not, the target variable will be 1 if the observation has diabetes.

Confusion Matrix

I prefer to give an example to explain this. To illustrate what the confusion matrix is, assume this: we have 990 normal transactions and 10 fraud transactions. We will create a model to predict whether transactions are fraud or not. Then the model will infer 5 true and 5 false transactions for a total of 10 fraud transactions we already have. And 90 false and 900 true transactions infer for 990 non-fraud transactions we already have.

At this point, we create a confusion matrix. In this matrix, rows represent the real classes and columns represent the predicted classes for the predicted target variable.

True Positive (TP): The predicted class as 1 is actually 1. A fraud transaction is predicted as fraud transaction. So the prediction is true.
False Positive (FP): The predicted class as 1 isn’t 1 in actuality. A non-fraud transaction is predicted as a fraud transaction. So, the prediction is false.
False Negative (FN): The predicted class as 0 isn’t 0 in actuality. The model predicted non-fraud for a real fraud transaction. So, fraud transaction isn’t predicted as fraud transaction. So, the prediction is false.
True Negative (TN): The predicted class as 0 is 0 in actuality. The model predicted non-fraud for a non-fraud transaction. So, the prediction is true.

Sometimes some terms can be confusing such as Positive and Negative. What are positive and negative for us? We say Positive if we work on the target class (1). And we say Negative if we work on the non-target class (0). So, the left column represents Positives and the right column represents Negatives.

You may ask: Which one is more critical False Positive(FP) or False Negative(FN)? I can say False Negative is more critical than False Positive. To illustrate this: we said non-fraud for a fraud transaction and someone robbed a huge money.

Now we can continue to learn measurement metrics of classification models.

Measurement Metrics of Classification Models

We have a bunch of metrics to measure the success of classification models.

We see the formulas above to measure the metrics.

Accuracy: What percentage of success did we have in the correct estimating 1 is 1 and 0 is 0? For instance, the model estimated 5 (totally 10) real fraud transactions as fraud transactions and 900 (totally 990) non-fraud transactions as non-fraud transactions. We will sum both of them and divide them into the count of all observations.
Precision: What percentage of success did we have in the correct values estimated as 1? For instance, the model estimated 5 transactions as fraud transactions, and 90 non-fraud transactions as fraud transactions. There are just 5 correct estimations. We calculate the percentage of correct estimations as estimated 1.
Recall: What percentage of success did we have in the correct estimating values that actually had a class of 1? For instance, we have 10 non-fraud transactions. The model estimated 5 transactions as fraud and 5 transactions as non-fraud. We calculate the percentage of real fraud transactions’ as correct estimations.
F1-Score: We can say It’s the harmonic mean of precision and recall.

Applied Example in Python

Actually, I won’t deep dive into what Logistic Regression is and how we can use it.

I’m just going to import the libraries we will need. As we see, we use sklearn.metrics to get measuring model’s success.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler

I’ll skip fastly the introduction section and create a model directly.

df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')X = df.drop(['Outcome'],axis=1)
y = df['Outcome']robscaler = RobustScaler()
X = robscaler.fit_transform(X)X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.8, random_state=6)logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)confusion_matrix(y_test,y_pred) # to get cunfusion matrix
print(classification_report(y_test,y_pred)) # to get mectrics such as precision, recall etc.

Finally

I know, It might be too fast skipping but I don’t want to be confusing. I wanted to focus on the main topics such as what are the measuring metrics? The Python code wasn’t the main topic. Hopefully, you extracted some tidbits for yourself and it was helpful.

You can get the Google Spreadsheet below I created for you. Furthermore, you can play with the confusion matrix by the Spreadsheet.