F1 Score: Way to evaluate the model’s performance

Pooja Sharma
Analytics Vidhya
Published in
3 min readDec 24, 2020
Source: Unsplash

One of the challenges with building a machine learning system is that there are so many things we could try, so many things we could change for example tuning hyperparameters. Whether we are tuning hyperparameters or trying out different ideas for learning algorithms, or just trying out different options for building a machine learning system. One thing that surely improves progress is to have a single real number evaluation metric that lets you quickly tell if the new thing just you tried is working better or worse than your last idea.

Source: Google Images

One reasonable way to evaluate the performance of a classifier is to look at its precision and recall. Precision is the fraction of correct positives among the total predicted positives. Let’s say if classifier A has 95% precision, this means that when classifier A says something is a dog, there is a 95% chance it really is a dog. High precision relates to the low false-positive rate. And Recall is the fraction of correct positives among the total positives in the dataset means of all the images that really are dogs, what percentage were correctly recognized by the classifier.

Source: Google Images

Precision: of all the images we predicted y = 1, what fraction of it have dogs?
Precision = True positives / Number of predicted positive= TP / (TP + FP)

Recall: of all the images that actually have dogs, what fraction of them correctly identified?
Recall = True positive / Number of predicted actual positive= TP / (TP + FN)

Likewise, For classifier A, there is a 95% chance that there is a dog in the image and a 90% chance that it has correctly detected a dog. Whereas for classifier B there is a 98% chance that there is a dog in the image and an 85% chance that it has correctly detected a dog.
So, The problem with precision and recall as evaluation metrics is that if classifier A does better on precision, the classifier B does better on recall then we will not sure which classifier is best. And in the real scenario when we have a lot of different ideas, and a lot of different hyperparameters to try, and have dozen classifier then this is not practically possible to pick one of them.
So what is recommended that rather than two numbers, precision, and recall, to pick a classifier, we need a single evaluation metric that combines both precision and recall, F1 score, generally F1 score is the harmonic mean of precision P and Recall R.

F1 Score = 2 / (1/p + 1/r) = (2 × p × r) / (p + r)

Following is a sample code in python to use the F1 score in our code

# Import f1_score
from sklearn.metrics import f1_score
# Instantiate the classifier
clf = RandomForestClassifier()
# Fit to the training data
clf.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = clf.predict(X_test)
# Print the F1 score
print(f1_score(y_test, y_pred))

Sometimes the single evaluation metric does not work. Suppose in a case we have to care about accuracy as well as the running time of the classifier then in those cases it is good to set up satisfying as well as optimizing metric.

For example, while building an image classification system it requires that running time should be less than 100ms and while checking with different classifiers we see that classifier A, B, and C gives 90%, 93%, and 96% accuracy respectively and running time is 80ms, 95ms, and 500ms respectively. So here we can say that accuracy is an optimizing metric because we want to maximize accuracy and running time is satisfying metric. By defining optimizing and satisfying metric we can conclude that Classifier B is the best classifier in this case.

--

--