Survey on Multi-Label Text Classification using NLP and Machine Learning

Mageshwaran R
Technovators
Published in
10 min readFeb 19, 2020

In this article, we’ll look into Multi-Label Text Classification which is a problem of mapping inputs (x) to a set of target labels (y), which are not mutually exclusive. For instance, a movie can be mapped to one or more genre(s).

https://thinkpalm.com/blogs/natural-language-processing-nlp-artificial-intelligence/

First, we’ll explore multi-label classification in general, then we’ll try various methods to build a multi-label text classifier with Reuters dataset.

Introduction to Multi-label Classification:

Let’s have a look at the image below,

  • On the left is a binary classification problem where our goal is to predict whether the given instance(email) is spam or not.
  • In the middle, we have a Multi-class classification problem where our goal is to predict which animal appears in the image and here it’s strictly limited to one animal per instance.
Type of Classification Tasks. Source(https://www.microsoft.com/en-us/research/uploads/prod/2017/12/40250.jpg)
  • And at the extreme right, we have a Multi-Label Classification problem where one or multiple animals can appear in the image and our goal is to list(predict) all the animals.

As I mentioned earlier the difference between Multi-Class and Multi-Label lies in the fact that the data point in the later one is not mutually exclusive.

We have challenges here, traditional machine learning methods we use expect a single label per every input, but that’s not the case here.

How are we gonna solve this challenge?

  • Problem Transformation where we divide the multi-label problem into one or more conventional single-label problems, using either Binary Relevance or Label Powerset
  • Problem Adaption: Some classification algorithms/models like (knn, Decision trees) have been adapted to the multi-label task, without requiring problem transformations.

Having understood the multi-label classification problems and ways to solve it, let’s start to work on it.

Reuters Dataset

For this article, we’ll use Reuters which is a benchmark dataset for document classification. To be more precise, it is a multi-class (e.g. there are multiple classes), multi-label (e.g. each document can belong to many classes) dataset. It has 90 classes, 7769 training documents, and 3019 testing documents. The training set has a vocabulary size of 35247. Even if you restrict it to words that appear at least 5 times and at most 12672 times in the training set, there are still 12017 words. Let’s first import the dataset and create a data frame to store text inputs and output labels.

trainData = {"content": train_documents, "labels": train_categories}
testData = {"content": test_documents, "labels": test_categories}
trainDf = pd.DataFrame(trainData, columns=["content", "labels"])
testDf = pd.DataFrame(testData, columns=["content", "labels"])
trainDf.head()
Dataframe containing Text input and output labels

Pre-processing Text Data

Before we start to build models, let’s do some initial processing of text data. Below pre-processing steps are common for most of the NLP tasks (feature extraction for Machine learning models):

wordnet_lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stopwords = set(stopwords.words('english'))

def tokenize_lemma_stopwords(text):
text = text.replace("\n", " ")
# split string into words (tokens)
tokens = nltk.tokenize.word_tokenize(text.lower())
# keep strings with only alphabets
tokens = [t for t in tokens if t.isalpha()]
# put words into base form
tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]
tokens = [stemmer.stem(t) for t in tokens]
# remove short words, they're probably not useful
tokens = [t for t in tokens if len(t) > 2]
tokens = [t for t in tokens if t not in stopwords] # remove stopwords
cleanedText = " ".join(tokens)
return cleanedText

def dataCleaning(df):
data = df.copy()
data["content"] =data["content"].apply(tokenize_lemma_stopwords)
return data
cleanedTrainData = dataCleaning(trainDf)
cleanedTestData = dataCleaning(testDf)

TF-IDF Vectorization

In this article, we’ll use TF-IDF feature extraction method. You can try with various feature extractors like BOW, GloVe, Word2Vec, ELMo.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics

vectorizer = TfidfVectorizer()
vectorised_train_documents = vectorizer.fit_transform(cleanedTrainData["content"])vectorised_test_documents = vectorizer.transform(cleanedTestData["content"])

Let’s look into the frequency distribution of words after pre-processing using Yellowbrick.

from yellowbrick.text import FreqDistVisualizerfeatures = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer(features=features, orient='v')
visualizer.fit(vectorised_train_documents)
visualizer.show()

And here is the plot,

We can also Visualize the corpus with Uniform Manifold Approximation and Projection (UMAP) which is similar to T-SNE but better at preserving some aspects of the global structure of the data than most implementations of t-SNE.

from yellowbrick.text import UMAPVisualizer

umap = UMAPVisualizer(metric="cosine")
umap.fit(vectorised_train_documents)
umap.show()

Here we have not clustered the documents based on classes, but one thing we can understand is that there is a great correlation between features, we’ll discuss more on this when we start building models.

Vectorize Output labels

We need to transform the output labels in the list to a vector representation of 90 classes with bit 1s and 0s. We’ll use sklearn MultiLabelBinarizer for that.

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform(train_categories)
test_labels = mlb.transform(test_categories)

Building ML Models

As we saw earlier multi-label classification problems can be solved with either Problem adaption or Problem Transformation, We also have an ensemble method but it’s out of this blog’s scope.

  • First, we’ll try problem adaption, where we have classifiers that inherently handle multi-label classification. Then We’ll look at various methods of problem transformation.

Multi-Label Classifiers:

Here is a list of Multi-Label Classifiers that are available in sklearn.

Multi-Label Classifiers in sklearn. Source: https://scikit-learn.org/stable/modules/multiclass.html

We’ll try a few inherent classifiers for this blog post.

K-nearest Neighbor:

  • k-nearest neighbors algorithm (kNN) is a non-parametric technique used for classification.
  • Given a test document x, the KNN algorithm finds the k nearest neighbors of x among all the documents in the training set, and scores the category candidates based on the class of k neighbors.
  • The similarity of x and each neighbor’s document could be the score of the category of the neighbor documents.
  • Multiple KNN documents may belong to the same category; in this case, the summation of these scores would be the similarity score of class k with respect to the test document x. After sorting the score values, the algorithm assigns the candidate to the class with the highest score from the test document x.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier

knnClf = KNeighborsClassifier()

knnClf.fit(vectorised_train_documents, train_labels)
knnPredictions = knnClf.predict(vectorised_test_documents)

Decision Trees:

  • Tree-based methods are simple and useful for interpretation.
  • A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes).
  • The paths from the root to leaf represent classification rules.
from sklearn.tree import DecisionTreeClassifier

dtClassifier = DecisionTreeClassifier()
dtClassifier.fit(vectorised_train_documents, train_labels)
dtPreds = dtClassifier.predict(vectorised_test_documents)

Random Forests:

  • Random Forest is an improvement over decision trees.
  • It’s is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
from sklearn.ensemble import RandomForestClassifierrfClassifier = RandomForestClassifier(n_jobs=-1)
rfClassifier.fit(vectorised_train_documents, train_labels)
rfPreds = rfClassifier.predict(vectorised_test_documents)

Problem Transformation

Now let’s look at the other way of solving Multi-label Classification, Problem Transformation where we transform our Machine learning Classifiers (binary classifier) for multi-label classification.

For this blog post, We have few classifiers like Gradient Boosting, Bagging, Naive Bayes, and Linear SVC. I encourage you to try other classifiers as well.

In order to compare classifiers. I have used OneVsRest method(explained in the later part).

Bagging Classifier

  • The decision trees suffer from high variance. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
  • Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method; we introduce it here because it is particularly useful and frequently used in the context of decision trees.
  • It’s more similar to Random forests which we saw earlier, Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.
from sklearn.ensemble import BaggingClassifier

bagClassifier = OneVsRestClassifier(BaggingClassifier(n_jobs=-1))
bagClassifier.fit(vectorised_train_documents, train_labels)
bagPreds = bagClassifier.predict(vectorised_test_documents)

Gradient Boosting Classifier

  • We saw Bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set, independent of the other trees.
  • Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead, each tree is fit on a modified version of the original data set.
from sklearn.ensemble import GradientBoostingClassifier

boostClassifier = OneVsRestClassifier(GradientBoostingClassifier())
boostClassifier.fit(vectorised_train_documents, train_labels)
boostPreds = boostClassifier.predict(vectorised_test_documents)

Naive Bayes Classifier

  • Naive Bayes Classifier (NBC) is a generative model that is widely used in Information Retrieval.
  • The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).
  • The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
  • When there’s a high correlation between features, NB fails to work. In the previous section, we visualized correlation using UMAP and hence we assume that NB’s performance won’t be great.
from sklearn.naive_bayes import MultinomialNB

nbClassifier = OneVsRestClassifier(MultinomialNB())
nbClassifier.fit(vectorised_train_documents, train_labels)
nbPreds = nbClassifier.predict(vectorised_test_documents)

Linear SVC

  • Support Vector Machine (SVM), a discriminative model that was developed in the computer science community in the 1990s and that has grown in popularity since then.
  • SVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers.
  • The support vector machine is a generalization of a simple and intuitive classifier called the maximal margin classifier.
  • Here we use Linear SVC which uses Squared Hinge loss for learning, which helps the model in a better discrimination process.
from sklearn.svm import LinearSVC

svmClassifier = OneVsRestClassifier(LinearSVC(), n_jobs=-1)
svmClassifier.fit(vectorised_train_documents, train_labels)

svmPreds = svmClassifier.predict(vectorised_test_documents)

Based on the results(will be discussed in evaluation part), I have chosen Linear SVC (uses squared hinge loss) to compare various Problem transformation techniques.

Binary Relevance

  • In Binary Relevance, an ensemble of single-label binary classifiers is trained independently on the original dataset to predict a membership to each class.
  • Example: If there are q labels, the binary relevance method creates q new data sets from the dataset, one for each label and train single-label classifiers on each new data set.
from sklearn.svm import LinearSVC
from skmultilearn.problem_transform import BinaryRelevance

BinaryRelSVC = BinaryRelevance(LinearSVC())
BinaryRelSVC.fit(vectorised_train_documents, train_labels)

BinaryRelSVCPreds = BinaryRelSVC.predict(vectorised_test_documents)

One-vs-Rest

  • Here strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives.
  • It’s more or less similar to Binary Relevance except for a fact that One-vs-Rest works on mutually exclusive labels.
from sklearn.svm import LinearSVC

svmClassifier = OneVsRestClassifier(LinearSVC(), n_jobs=-1)
svmClassifier.fit(vectorised_train_documents, train_labels)

svmPreds = svmClassifier.predict(vectorised_test_documents)

Label Power Set

  • This approach does take possible correlations between class labels into account meaning it maps each combination of labels into a single label and trains a single label classifier.
  • As the number of classes increases distinct label combinations grow exponentially.
from skmultilearn.problem_transform import LabelPowerset

powerSetSVC = LabelPowerset(LinearSVC())
powerSetSVC.fit(vectorised_train_documents, train_labels)

powerSetSVCPreds = powerSetSVC.predict(vectorised_test_documents)

Model Evaluation

Finally, It’s time for us to evaluate the models that we have built so far. Here we’ll discuss some of the averaging based metrics for the multi-label and multi-class problem.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, hamming_lossModelsPerformance = {}

def metricsReport(modelName, test_labels, predictions):
macro_f1 = f1_score(test_labels, predictions, average='macro')

micro_f1 = f1_score(test_labels, predictions, average='micro')

hamLoss = hamming_loss(test_labels, predictions)
ModelsPerformance[modelName] = micro_f1

Macro-Averaging

  • Macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally).

Micro-Averaging

  • Micro-average will aggregate the contributions of all classes to compute the average metric.

In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

Hamming Loss

  • Hamming-Loss is the fraction of labels that are incorrectly predicted, i.e., the fraction of the wrong labels to the total number of labels.
  • Lower the value better the model

Performance Comparison

We have seen various methods of building Multi-label classifiers and also various evaluation metrics for our problem. It’s time for us to combine them and evaluate our models based on predictions from the testing set.

We have below the comparison of models based on Micro-averaged F1_score.

Final Thoughts on the result,

  • Based on the results we can see that Linear SVC has the best performance and since it’s not a mutually exclusive problem, Binary Relevance and OneVsRest give the same result.
  • As mentioned earlier, because of the high correlation between features Multinomial Naive Bayes performs badly and for the same reason, Random forest fails miserably.
  • We can see an improved performance while using ensemble methods of Decision trees (i.e) Bagging and Boosting classifiers.
  • The best model (Linear SVC )gives a hamming loss of 0.0034, and it’s the lowest loss score among other models as well.

Hope you enjoyed this blog post, Thanks for your time :)

You can find the whole implementation in Kaggle here. Please feel free to pull this code from my GitHub.

Check out related blogs,

Happy Learning !!!

--

--

Mageshwaran R
Technovators

AI Engineer | NLP | Computer Vision. An avid reader of Neuroscience, Psychology, and Decision Making. https://mageshwaran.com