Evaluation Metrics for Classification Problems with Implementation in Python

Published in

Analytics Vidhya

6 min readFeb 7, 2021

This article focuses on all the most commonly used evaluation metrics used for classification problems and the type of metric that should be used depending on the data along with a demo on breast cancer dataset.

Classification is a supervised learning technique which involves predicting the class label for the given input data. In a classification problem, we understand the problem, explore the data, process the data and then build a classification model using machine learning algorithms or a deep learning technique. In case of machine learning or deep learning it is always the best practice to test the model. By evaluating the model we can measure the quality of our model and can see how well can our model do with respect to our use case.

Accuracy
Confusion Matrix
Precision
Recall
F1 Score
AUC-ROC Curve

Let us consider breast cancer data set and try to understand the metrics based on the classifier built for breast cancer prediction.

Accuracy

The accuracy of a classifier is calculated as the ratio of the total number of correctly predicted samples by the total number of samples.

Accuracy metric can be used to evaluate the classifier when the data set is a balanced data set. Accuracy metric should not be used when the data set is imbalanced. Let us consider a data set with two target classes containing 100 samples out of which 95 samples belong to class 1 and 5 samples belong to class 2. When we try to build a classifier for the above data set, the classifier will be biased to class 1 and will result is predicting all the samples as class 1 samples. This will result in an accuracy of 95%, which is false. To avoid this mistake accuracy metric should be only used balanced data set.

Now let us look into the code to get the accuracy of a classifier:

Output:
Accuracy of the classifier is: 0.9473684210526315

Confusion Matrix

A confusion matrix is an N dimensional square matrix, where N represents total number of target classes or categories. Confusion matrix can be used to evaluate a classifier whenever the data set is imbalanced. Let us consider a binary classification problem i.e. the number of target classes are 2. A typical confusion matrix with two target classes (say “Yes” and “No”) looks like:

There are four important terms in a confusion matrix

True Positives (TP): These are the cases where the predicted “Yes” actually belonged to class “Yes”.
True Negatives (TN): These are the cases where the predicted “No” actually belonged to class “No”.
False Positives (FP): These are the cases where the predicted “Yes” actually belonged to class “No”.
False Negatives (FN): These are the cases where the predicted “No” actually belonged to class “Yes”.

Now let us look into the code to generate and plot the confusion matrix for our breast cancer classifier.

Output:
[[39  2]
 [ 4 69]]

Confusion Matrix for Breast Cancer Classifier

From the above confusion matrix:

True Positives (TP): 69
False positives (FP): 2
True Negatives (TN): 39
False Negatives (FN): 4

The accuracy of the classifier can be calculated from the confusion using the below formula:

Accuracy = (TP + TN) / (TP + FP + TN + FN)

The accuracy of our classifier is: (69+39) / (69+39+2+4) = 0.947 = 94.7%

Precision (or Positive Predictive Value)

Precision is the ratio of true positives (TP) by the sum of true positives (TP) and false positives (FP).

Let us consider a data set with two target classes (say positive and negative) then precision tells us, out of total predicted positive values how many were actually positive. Precision should be used based on the use case. Take an example use case of spam detection. If our model detects a mail as spam which was not actually a spam mail then the user might miss an important mail i.e. here false positives should be reduced. So, in this use case we need to use precision as a metric to measure the quality of our classifier.

Now let us look into the code to calculate the precision score for our breast cancer classifier:

Output:
Precision Score of the classifier is: 0.971830985915493

Recall (or Sensitivity or True Positive Rate)

Recall is the ratio of true positives (TP) by the sum of true positives (TP) and false negatives (FN).

Let us consider a data set with two target classes (say positive and negative) then recall tells us, out of total actual negative values how many did our classifier predict negatively. Similar to precision, recall should also be used based on the use case. Take an example use case of cancer prediction. Consider a person who is actually having cancer but was predicted as a non-cancer patient by our classifier which can lead to mistreatment of the person i.e. here false negatives should be reduced. So, in this case, we need to use recall as a metric to measure the quality of our classifier.

Now let us look into the code to calculate the recall score for our breast cancer classifier:

Output:
Recall Score of the classifier is: 0.9452054794520548

F1 Score

F1 score should be used when both precision and recall are important for the use case. F1 score is the harmonic mean of precision and recall. It lies between [0,1].

F1 score is derived from F Beta Score. F Beta score is the weighted harmonic mean of precision and recall.

If both False Positives (FP) and False Negatives (FN) are important then β = 1.
If False Positive (FP) is important then β lies between o and 1.
If False Negative (FN) is important then β > 1.

Now let us look into the code to calculate the f1 score for our breast cancer classifier:

Output:
F1 Score of the classifier is: 0.9583333333333334

AUC-ROC Curve

AUC-ROC Curve is a performance metric that is used to measure the performance for the classification model at different threshold values. ROC is Receiver Operating Characteristic Curve and AUC is Area Under Curve. The higher the value of AUC (Area under the curve), the better is our classifier in predicting the classes. AUC-ROC is mostly used in binary classification problems.

The ROC curve is plotted between True Positive Rate (TPR) and False Positive Rate (FPR) i.e. TPR on the y-axis and FPR on the x-axis. AUC is the area under the ROC curve. An excellent classifier has an AUC value near 1, whereas a poor-performing classifier has an AOC value near 0. A classifier with an AOC score of 0.5 doesn’t have any class separation capacity.

Now let us look into the code to calculate and plot ROC AUC for our breast cancer classifier:

Output:
AUC of the classifier is: 0.9769462078182426

If you liked this article please follow me. If you noticed any mistakes in the formulas, code or in the content, please let me know.
You can find me at : LinkedIn, GitHub

Venu Gopal Kadamba | LinkedIn

View Venu Gopal Kadamba's profile on LinkedIn, the world's largest professional community. Venu Gopal has 3 jobs listed…

www.linkedin.com

venugopalkadamba - Github

Student 👨‍🎓 and Python Programmer. venugopalkadamba has 21 repositories available. Follow their code on GitHub.

github.com