An Introduction to Multi-Label Text Classification

Joyce Annie George
Analytics Vidhya
Published in
6 min readNov 13, 2020
Photo by Pierre Bamin on Unsplash

Machine learning problems can be divided into supervised and unsupervised learning. Classification forms a major part of supervised problems. Classification is the process of grouping together data based on different parameters. Now, let us look into different kinds of classification problems.

Binary vs Multi-Class vs Multi-Label

Classification problems can be binary, multi-class or multi-label. In a binary classification problem, the target label has only two possible values. For example, an email spam detection algorithm predicts a given email as spam or not. The difference between binary and multi-class classification is that multi-class classification has more than two class labels. A multi-label classification problem has more than two class labels, and the instances may belong to more than one class. Multi-label classifiers are not mutually exclusive. In other words, a multi-class classification problem assigns only one label to an instance, whereas a multi-label classification problem may assign one or more labels.

In this article, we are working with this dataset to classify research articles into different topics. Now, let us explore the data. First of all, let us import all the libraries and the dataset.

train_data.head

The dataset consists of titles and abstracts for thousands of research articles. The train data contains research articles from 6 different topics — Computer Science, Physics, Mathematics, Statistics, Quantitative Biology and Quantitative Finance. Our objective is to tag unseen articles under these given topics. Now, let us explore our dataset further. Let us now look into the number of tags for the articles.

We have more than 20,000 articles in the train data. All the articles are labeled under at least one topic. There are some articles with more than one topic. As our dataset contains articles with multiple tags, we are dealing with a multi-label classification problem. Let us plot a graph to look at the class distribution.

Class distribution plot

It is clear from the plot that the most popular topic is Computer Science. The data is class imbalanced. The number of articles under the topics of Quantitative Biology and Quantitative Finance is too low. As of now, we are not making the data balanced. Now let us analyze the number of tags for each article.

Plotting tags per article

Most of the articles have only one tag. But, there are articles with two and three tags. Now, let us start with the data preprocessing. We will combine the two columns title and abstract to form a new column ‘Text’.

We have to remove all the stopwords and special characters in the articles. We have also used a snowball stemmer. A stemmer transforms all the different forms of a word into a single word.

Multi-Label Classification

Now, let us move towards the classification part. Multi-label classification problems use techniques like problem transformation and algorithm adaption. The problem transformation method transforms the classification problem into several single label problems. In the problem adaption method, some algorithms are adapted to perform the multi-label classification. In this article, we will be discussing problem transformation methods. Let us take a look at some of these methods.

  • Binary Relevance: Binary relevance transforms a multi-label classification problem with L labels into L separate single-label binary classification problems. In this approach, each classifier predicts the membership of one class. The union of the predictions of all the classifiers is taken as the multi-label output.

Let us go through an example. Let X be the input and y1, y2, y3 and y4 be the labels in the below dataset.

Image by Author

As there are 4 labels, binary relevance uses 4 separate binary classifiers. Each classifier is a binary classifier for each label in the dataset.

Image by Author

As shown in the above figure, the multi-label classification problem is transformed into a simpler binary classification problem. As this is an easy approach, binary relevance is very popular. But, the main drawback of binary relevance is that it ignores the possible correlations between the classes. Now, let us classify our articles using this approach.

  • Classifier Chains: This technique is similar to binary relevance. But it takes label correlation into account. This approach uses a chain of classifiers where each classifier uses the predictions of all the previous classifiers as input. The total number of classifiers is equal to the number of classes.

As our sample dataset has four classes, this approach will create 4 classifiers. The initial classifier trains the data with the class label y1. The output of this classifier is given as the input of the next classifier, and so on. As this method uses label correlation, it gives better results than binary relevance.

Image by Author

Now, let us apply classifier chain to classify our research articles.

  • Label Powerset:

The idea behind label powerset is to transform the multi-label classification into a multi-class problem. In this approach, a classifier is trained on all the unique label combinations in the training dataset.

Image by Author

Consider the above image. The items x1 and x4 has the same set of labels. Similarly, x2 and x6 can be classified as a single label. The items x3 and x5 also share the same set of labels. So, label powerset transforms the given multi-label classification problem into a single multi-class classification problem.

Image by Author

As the number of class labels increases, the number of unique label combinations also increase. This would make it expensive to use this approach. Another disadvantage is that it predicts only the label combinations that are seen in the training dataset.

Conclusion

We have discussed the problem transformation method to perform multi-label text classification. There are several approaches to deal with a multi-label classification model. We can use adapted algorithms like MLkNN or ensemble approaches to generate a better model. We also noticed that the data is imbalanced. We can further improve our model by applying techniques like MLSMOTE to balance the input data. It is also a good idea to use deep learning techniques like LSTM.

Reference

--

--

Joyce Annie George
Analytics Vidhya

MS, CS, Santa Clara University. Passionate about data collection, analysis and visualization. https://www.linkedin.com/in/joyce-annie-george/