Metrics for Multi-Label Classification

Published in

Analytics Vidhya

3 min readAug 28, 2020

Metrics play quite an important role in the field of Machine Learning or Deep Learning. We start the problems with metric selection as to know the baseline score of a particular model. In this blog, we look into the best and most common metrics for Multi-Label Classification and how they differ from the usual metrics.

Let me get into what is Multi-Label Classification just in case you need it. If we have data about the features of a dog and we had to predict which breed and pet category it belonged to.

In the case of Object Detection, Multi-Label Classification gives us the list of all the objects in the image as follows. We can see that the classifier detects 3 objects in the image. It can be made into a list as follows [1 0 1 1] if the total number of trained objects are 4 ie. [dog, human, bicycle, truck].

This kind of classification is known as Multi-Label Classification.

The most common metrics that are used for Multi-Label Classification are as follows:

Precision at k
Avg precision at k
Mean avg precision at k
Sampled F1 Score

Let’s get into the details of these metrics.

Precision at k (P@k):

Given a list of actual classes and predicted classes, precision at k would be defined as the number of correct predictions considering only the top k elements of each class divided by k. The values range between 0 and 1.

Here is an example as explaining the same in code:

Running the following code, we get the following result.

In this case, we got the value of 2 as 1, thus resulting in the score going down.

Average Precision at K (AP@k):

It is defined as the average of all the precision at k for k =1 to k. To make it more clear let’s look at some code. The values range between 0 and 1.

Here we check for the AP@k from 1 to 4. We get the following output.

This gives us a clear understanding of how the code works.

Mean Average Precision at K (MAP@k):

The average of all the values of AP@k over the whole training data is known as MAP@k. This helps us give an accurate representation of the accuracy of whole prediction data. Here is some code for the same.

The values range between 0 and 1.

Running the above code, we get the output as follows.

Here, the score is bad as the prediction set has many errors.

F1 — Samples:

This metric calculates the F1 score for each instance in the data and then calculates the average of the F1 scores. We will be using sklearn’s implementation of the same in the code.

Here is the documentation of F1 Scores. The values range between 0 and 1.

We first convert the data into binary format and then perform f1 on the same. This gives us the required values.

The output of the following code will be the following:

We know that the F1 score lies between 0 and 1 and here we got a score of 0.45. This is because the prediction set is bad. If we had a better prediction set, the value would be closer to 1.

Hence based on the problem, we usually use Mean Average Precision at K or F1 Sample or Log Loss. Thus setting up the metrics for your problem.

I would like to thank Abhishek for his book Approaching (Any) Machine Learning Problem without which this blog wouldn’t have been possible.

Metrics for Multi-Label Classification

Precision at k (P@k):

Average Precision at K (AP@k):

Mean Average Precision at K (MAP@k):

F1 — Samples:

Written by Lohithmunakala