Evaluating ML Models: Precision, Recall, F1 and Accuracy

Using these metrics to determine if a model is effective

GreekDataGuy
Analytics Vidhya
4 min readSep 2, 2019

--

Last post we discussed how accuracy can be a misleading metric for gauging AI model performance.

So what metrics should we use instead of accuracy? Precision, Recall and F1.

Our example

We’re going to explain accuracy, precision, recall and F1 related to the same example and explain pros/cons of each.

Example:

We’ve built a model that predicts what companies will survive longer than 3 years.
It has made predictions for 10 different companies seen below.

1 = survived
0 = failed

### Tabular Format ###
'''
Company No. 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
--------------------------------------
Actual: 1 1 0 1 0 0 0 0 0 0
Predicted: 1 1 1 0 1 0 0 0 0 0
'''
### In Python ###
y_true = [1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 1, 0, 0, 0, 0, 0]

Accuracy

The percentage of instances where the model predicted the correct value. (Aka. Percent of cases correct in predicting a company would succeed or fail)

### Python ###y_true = [1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 1, 0, 0, 0, 0, 0]
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
>>> 0.7

At first glance this might seem like a good success metric because the model actually was correct most of the time. But it depends on the context.

If you used this model to pick companies to invest in, you’d have lost your money in 50% of your 4 investments. Accuracy is a bad metric to evaluate your model in that context.

Precision

The fraction of relevant instances retrieved from total instances retrieved. Aka. The number of correct prediction among the companies predicted to succeed.

You can calculate this metric for both, i) cases the model predicted 1, and ii) cases the model predicted 0. See both below, but the positive case is most relevant to our example.

### Python ###from sklearn.metrics import precision_scorepositive = precision_score(y_true, y_pred, pos_label=1)
print(positive)
>>> 0.5
negative = precision_score(y_true, y_pred, pos_label=0)
print(negative)
>>> 0.8333333333333334

If we stick with the same context of picking companies to invest in then precision (for positive cases) is actually a good metric for evaluating this model.

You would know to not invest in companies recommended by the model, because you’re aware its only correct 50% of the time

Recall

The fraction of relevant instances retrieved over the total relevant instances retrieved. Aka. What percent of all successful companies did the model find.

Again relevant instances could be 0 or 1 cases but we’ll calculate both here.

from sklearn.metrics import recall_scorepositive = recall_score(y_true, y_pred, pos_label=1)
print(positive)
>>> 0.6666666666666666
negative = recall_score(y_true, y_pred, pos_label=0)
print(negative)
>>> 0.7142857142857143

Here we see that out of the successful companies, the model found 67%. With recall we don’t consider how many companies we incorrectly predicted would succeed, just that we discovered the companies that would succeed.

We probably wouldn’t use recall to evaluate our model if the context is making investments. Unless we’re a VC firm making a large number of small investments with potential 1000x payoffs, and we didn’t want to miss any companies that might become successful.

F1

F1 is the harmonic mean of precision and recall.

F1 takes both precision and recall into account.

I think of it as a conservative average. For example:
The F1 of 0.5 and 0.5 = 0.5.
The F1 of 1 and 0.5 = 0.66.
The F1 of 1 and 0.01 = 0.02.

Again, we can calculate it for both positive and negative cases.

### Python ###from sklearn.metrics import f1_scorepostive = f1_score(y_true, y_pred, pos_label=1)
print(postive)
>>> 0.5714285714285715
negative = f1_score(y_true, y_pred, pos_label=0)
print(negative)
>>> 0.7692307692307692

We typically use F1 when both precision and recall matter. In my real world experience they both often do.

Conclusion

Before I let you go, we could have printed all scores simultaneously with this sklearn report.

### Python ###from sklearn.metrics import classification_reportreport = classification_report(y_true, y_pred)
print(report)

In conclusion the context always matters. And this often requires careful thought, domain knowledge and an understanding of what you’re trying to achieve.

--

--