Multilabel Classification for Class Sets

Moustafa Ayoub
Ixor
Published in
5 min readOct 28, 2020

Our IxorThink “Toolbox” has the ability to classify and detect word entities in documents which in turn improves complex document-processing-flows.

One of the problems we faced was that some words would belong to multiple classes. However, a normal multilabel classification approach was not an optimal solution, since we have to start with the fact that not all classes are independent of each other. More specifically the classes could be separated into groups, further referenced as class sets.

This article will mainly focus on the theoretical approach of the multilabel classification problem for class sets, in which I will present you with a brief comparison of this problem with the other common classification problems.

MultiClass Classification:

“Select only one of the following options…”

One of the most popular classification problems in machine learning. In this problem, you would have to assume that the output classes are dependent. Meaning that only one of the classes can be selected as the output.

An example of a multi-class classification problem is classifying a car brand out of 6 brands, and the brands (labels) are: [BMW, VW, Mercedes, Ford, Tesla, Toyota]. Because a car cannot belong to two brands, the output of the classifier should point to only one of the brands and not a combination of them.

The most common loss function used in the multiclass classification is the Categorical Cross Entropy in combination with a softmax function as an activation function for the output layer.

The softmax function normalises the output of the network to a probability distribution over all of the predicted output classes. This is the core function that implies the dependency over the labels, our main assumption.

An example output of a multiclass classification problem with 6 classes would look like this: [0, 0, 1, 0, 0, 0]. One class has a value of one and the others would be zero. This is what we typically call one-hot encoding. And if that were to be translated to an integer output it would represent class [2] which would represent the label [Mercedes] from our example above.

Simple MultiClass Classification Network

Multilabel Classification:

“Select zero or one or even more options”

Another classification problem is the multilabel classification problem, where you assume that each output class is independent of the other classes. Meaning that each class is a binary classification problem of its own.

An example of a multilabel classification problem is to assign movie genres to a movie. Most movies can be categorised in multiple genres e.g. an action-comedy movie. So in this problem, the output of the classifier can point to one or more categories.

The most common loss function used in the multilabel classification is the Binay Cross Entropy, with a sigmoid function as an activation function for the output layer.

Using the sigmoid function transforms each output class of the network to a probability value distribution traditionally between zero and one.

An output example for a multilabel classification problem with 6 classes e.g.
[Comedy, Thriller, Horror, Mistery, Romance, Action] would like this:
[0, 1, 1, 0, 0, 0]. Where you see multiple classes with a value of one, this is what we typically call multi-hot encoding. In integer output space, it would be represented as [1, 2] which corresponds to the labels [Thriller, Horror] from our example above.

Simple MultiLabel Classification Network

Multilabel Classification with Class Sets:

“Best of both worlds”

This approach is a combination of the above two classification problems where we can’t assume that the outputs are completely independent or dependent. However, the assumption that can be made, is that labels of the same class set are dependent to each other and independent to labels of the other class sets.

An example of a multilabel classification problem with class sets is classifying a vehicle’s brand out of 6 brands (labels) that are: [BMW, VW, Mercedes, Ford, Tesla, Toyota], plus the vehicle’s type out of the 3 types (labels) that are: [motorcycle, car, truck].

Now to keep things clear, we have two class sets:

  • Class set A (brands) is [BMW, VW, Mercedes, Ford, Tesla, Toyota]
  • Class set B (type) is [motorcycle, car, truck]

We want the classifier to output 2 labels, one for the brands (not a combination of the brands) and one for the type (not a combination of types). The way to approach this problem is splitting it into two separate multiclass classification problems that are classified by one classifier.

We will use:

  • Two activation functions (Softmax) to calculate two losses, one for each class set. This will create two probability distributions over all the labels. This will result eventually into selecting two classes, one for each class set.
  • Two loss functions (Categorical Cross Entropy) to calculate two losses, one for each class set.

An output example for a mulatilabel classification with class sets would look like this: [0, 0, 1, 0, 0, 0] for label set A and [0, 0, 1] for label set B.
If that were to be translated to an integer output it would represent class [2, 2] which would represent the labels [Mercedes, Truck] from our example above.

Simple MultiClass Class Set Classification Network

The main advantage of tackling this multilabel classification problem as one with class sets, is that it leverages the power of the softmax function. Identifying the appropriate situations to use the softmax, will improve your model performance. Why this is the case is probably because the softmax function generates stronger gradients compared to the sigmoid function.

To illustrate this let's examine the below example for a model with two output neurons:

Target: [1,0]
Model logits: [1,2] (the output before the activation function)

Output after softmax: [0.2689, 0.7311]
Output after sigmoid: [0.7311, 0.8808]

As you can see the difference between the model logits is amplified by the softmax, compared to the sigmoid function’s output. Since Cross Entropy loss functions use the difference between the output and the target as a base for the gradient, the gradients will be larger in the case of the softmax.

Conclusion

Identifying the correct type of classification, especially in multilabel problems will boost your model’s performance significantly.

At IxorThink we are constantly trying to improve our methods to create state-of-the-art solutions. As a software-company, we can provide stable and fully developed solutions. Feel free to contact us for more information.

--

--