Evaluation Metrices for Classification Problems

4 min readJul 19, 2022

Classification through machine learning has many applications, we can classify between dead or alive (Classic Titanic Problem!), we can classify between different types of product based on its features, we can classify between different species of animals or plants based on their traits. But the question arises, how to evaluate our model. How much accurate is our model? We can’t just compare each data point manually, we need to devise some solution to our problem. Here in this article, I will discuss some of the major measures used for evaluation of Model.

Accuracy:

The classical approach for model evaluation is checking its accuracy. The accuracy of a model tells us about how many of our samples are accurately predicted, the formula for accuracy is mentioned below

The accuracy checks if prediction equals actual value, if matched, 1 is returned, the summation function sums all ones and divides by total number of samples. We can use accuracy in scikit-learn.

from sklearn.metrics import accuracy_score
print(accuracy_score(y_true, y_pred))

y_true is true values of labels for classification and y_pred are predicted labels for classification.

Confusion Matrix:

This matrix is very useful for identifying predictions. There are 4 sections here. TP means true positives. In binary classification, TP are the values truly classified as 1. FN means false negatives, those values that were classified as 0 but they were actually 1. FP means false positives, meaning those points that were actually 0 but they were classified as 1. TN means true negative. It is the number of negatives correctly classified i.e. they were 0 and were classified zero. These labels help us a lot in developing useful measures for evaluation.

We can use scikit-learn to compute confusion matrix.

from sklearn.metrics import confusion_matrix
TN, FP, FN, TP= confusion_matrix(y_true, y_pred)

Balanced Accuracy:

Sometimes, our data is not balanced. That means, in case of classifying types of plants, there are more plant1 than plant2. So classic accuracy may give biased results over this. This problem can be solved by using Balanced accuracy

We can use scikit-learn to compute balanced accuracy

from sklearn.metrics import balanced_accuracy_score
print(balanced_accuracy_score(y_true, y_pred))

Precision:

We can define precision as total number of correct positive predictions among total number of prediction that were predicted positive by the model. For example, model has to predict either a thing is mango or not mango. If it predicts there are 55 mangoes but actual truly predicted mangoes are 45 the precision is 45/55. The actual formula for precision is as following

We can calculate precision through scikit-learn

from sklearn.metrics import precision_score
print(precision_score(y_true,y_pred)

Recall:

We can define recall as total number of correct positives predicted by total number of positives there actually are. Discussing upper example, if model correctly predicts that there are 55 mangoes but actual mangoes there are 60 then recall is 55/60. The formula for calculating recall is following

Fig-5 : Recall

We can calculate recall through scikit-learn

from sklearn.metrics import recall_score
print(recall_score(y_true, y_pred)

F1-Score:

A model’s F1-score is computed by using only precision and recall. It measures model performance by calculating the harmonic mean of precision and recall for the minority positive class. Its value is between 0 and 1 (1 being the most accurate value). It describes model balanced activity between precision and recall. It can be also well applicable on imbalanced datasets because it is harmonic mean. The formula for calculating F1-score is

Fig-7 : F1-Score

We can calculate F1-score through scikit-learn

from sklearn.metrics import f1_score
print(f1_score(y_true, y_pred))

Jaccard Similarity Index:

The Jaccard similarity measures the similarity between two sets of data to see which members are shared and distinct. The Jaccard similarity is calculated by dividing the number of observations in both sets by the number of observations in either set. Formula for calculating Jaccard similarity is following

The numerator calculates similar values between predicted and actual values and denominator calculates all(total values) there are. We can calculate Jaccard similarity through scikit-learn

from sklearn.metrics import jaccard_score
print(jaccard_score(y_true, y_pred))

Conclusion:

There are many different ways for you to calculate and evaluate models that you created and trained. It all depends on your model and the data you are using.

Bibliography:

3.3. Metrics and scoring: quantifying the quality of predictions — scikit-learn 1.1.1 documentation

Jaccard Similarity — LearnDataSci

What is a good F1 score? Simply explained (2022) (stephenallwright.com)