Calculating Precision & Recall for Multi-Class Classification

Published in

Data Science in your pocket

4 min readApr 21, 2020

Any individual associated with Data Science must have heard of the terms Precision & Recall. We come across these terms quite often whenever we are stuck with any classification problem. if you have spent some time exploring Data Science, you must have an idea of how accuracy alone can be misleading many times in analyzing the performance of any model. I won’t be discussing that here.

My debut book “LangChain in your Pocket” is out now !!

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

Find the vlog version of the post below

The formulae for Precision & Recall won’t be alien to you either. Though let’s have a recap:

But can you smell something fishy in these formulae?

These formulae can be used with only the Binary Classification problem (Something like Titanic on Kaggle where we have a ‘yes’ or ‘no’ or with problems with 2 labels for example: Black or Red where we take one as 1 & other as 0 ).

What about Multi-Class Problems?

Like if I have a classification problem with 3 or more classes i.e Black, Red, Blue, White, etc.

The above formulae won’t just fit in!!! Though calculating accuracy won’t be a problem

Then how can you calculate Precision & Recall for problems with Multiple classes as labels?

Let us first consider the situation. Assume we have a 3 Class classification problem where we need to classify emails received as Urgent, Normal or Spam.

Now let us calculate Precision & Recall for this using the below methods:

MACRO AVERAGING:

The Row labels (index) are output labels(system output) & Column labels (gold labels) depicts actual labels. Hence,

[urgent,normal]=10 means 10 normal(actual label) mails has been classified as urgent.
[spam,urgent]=3 means 3 urgent(actual label) mails have been classified as spam

The mathematics isn’t tough here. Just a few things to consider:

Summing over any row values gives us Precision for that class. Like precision_u=8/(8+10+1)=8/19=0.42 is the precision for class:Urgent

Similarly for precision_n(normal), precision_s(spam)

Summing over any column gives us Recall for that class. Example:

recall_s=200/(1+50+200)=200/251=0.796. Similarly consider for recall_u (urgent) & recall_n(normal)

Now, to calculate the overall precision, average the three values obtained

MICRO AVERAGING:

Micro averaging follows the one-vs-rest approach. It calculates Precision & Recall separately for each class with True(Class predicted as Actual) & False(Classed predicted!=Actual class irrespective of which wrong class it has been predicted). The below confusion metrics for the 3 classes explain the idea better.

Now, we add all these metrics to produce the final confusion metric for the entire data i.e Pooled. Looking at cell [0,0] of Pooled matrix=Urgent[0,0] + Normal[0,0] + Spam[0,0]=8 + 60 + 200= 268

Now, using the old formula, calculating precision= TruePositive(268)/TruePositive(268) + FalsePositive(99)=0.73

Similarly we can calculate Recall as well.

Which one should I choose?

As we can see in the above calculations the Micro average is moved by the majority class (In our case, Spam), & therefore it might not depict the performance of the model on all classes(especially minority classes like ‘Urgent’ which have less samples in test data). If you observe, the model performs poorly for ‘Urgent’ but the overall number obtained by micro averaging can be misleading which gives 70% precision. Though, for class urgent, the actual precision is just 42%. Hence macro averaging does have an edge over micro averaging.