In this article, we will take a very hands-on approach to understand the multi-label classification in Machine Learning. We will be building a movie genre prediction model using NLP.

What is Natural Language Processing (NLP)?

Natural language processing (NLP) is a field concerned about how to program computers to process and analyze large amounts of natural language data. Using NLP we can have a deep understanding of data and we can also take actions based on the outcomes of data. It has a wide range of applications like Sentiment analysis, Text Classification, and Categorization, Automated Summarization etc.

In this project, using a Kaggle problem as an example, we explore different aspects of multi-label classification.

Overview of the project:

  • What is a multi-label classification?
  • Problem definition & evaluation metrics.
  • Exploratory data analysis (EDA).
  • Data pre-processing.
  • Featurizing/Vectorizing the data
  • Multi-label classification techniques.

What is Multi-label Classification?

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance.

In multi-label classification, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances through analyzing training instances with known label sets. Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.

Problem definition & Evaluation metrics.

Problem Definition: The dataset we chose is MPST: Movie Plot Synopses with Tags aimed to predict the genre of the movies based on their plot summaries. Automatic tagging systems can help recommendation engines to improve the retrieval of similar movies as well as help viewers to know what to expect from a movie in advance.

Kaggle link to this problem can be found here. The dataset contains all the IMDB id, title, plot synopsis, tags for the movies. There are 14,828 movies data in total.

Evaluation Metrics:

Micro-Averaged F1-Score (Mean F Score): The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the weighted average of the F1 score of each class.

‘Micro f1 score’: Calculate metrics globally by counting the total true positives, false negatives and false positives. This is a better metric when we have class imbalance.

‘Macro f1 score’: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

https://www.kaggle.com/wiki/MeanFScore
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Hamming loss: The Hamming loss is the fraction of labels that are incorrectly predicted.
https://www.kaggle.com/wiki/HammingLoss

Exploratory data analysis (EDA)

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

Let's get started but first, we will import some libraries and load our dataset.

We can check for the number of duplicate values, null values present in our dataset, number of times a tag appeared, number of tags per question.

Data Cleaning and Pre-Processing

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. It is the most crucial step to be performed before moving on to modeling. If not done, then perhaps our model will be too dumb to predict or our loss will be high.

Text pre-processing steps:-

  1. Removing the html tags, Remove any punctuations or limited set of special characters like , or . or # etc
  2. Check if the word is made up of english letters and is not alpha-numeric and Convert the words to lowercase
  3. Remove Stopwords present in the text using the default set of stop-words that can be downloaded from NLTK library. Stop words are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead.
  4. Stemming the data. There exist different kinds of stemming which basically transform words with roughly the same semantics to one standard form. For example, for amusing, amusement, and amused, the stem would be ‘amus’. So, it would be better to keep a single word ‘sing’ and remove the remaining. There are three different types of stemmer available in python namely Porter stemmer, Snowball Stemmer, Lancaster Stemmer.

PorterStemmer is least strict and Lancaster is strictest but really fast. Snowball stemmer is a good trade-off between speed and strictness. So, we can use Snowball Stemmer for stemming our text data.

Feature Engineering Strategies

  1. Bag of Words
  2. TF-IDF (Term Frequency * Inverse Document Frequency)
  3. Word2Vec
  4. Average Word2Vec
  5. TF-IDF weighted word2Vec

First, we will split our data into a train set and a test set and convert all our tags into vector form.

We will be focusing on the TF-IDF technique in this article to convert our train and test dataset into numerical vectors.

  • TF-IDF picks the most frequently occurring terms (words with high term frequency or tf). However, the most frequent word is a less useful metric since some words like ‘this’, ‘a’ occur very frequently across all documents.
  • Hence, we also want a measure of how unique a word is i.e. how infrequently the word occurs across all documents (inverse document frequency or idf).
  • So, the product of tf & idf (TF-IDF) of a word gives a product of how frequent this word is in the document multiplied by how unique the word is w.r.t. the entire corpus of documents.
  • Words in the document with a high tf-idf score occur frequently in the document and provide the most information about that specific document.
  • We can use N-grams to further improve our text featurizing and thus improving our model accuracy

Unigram means a single word. While converting text to vectors using TfidfVectorizer, by default we use the uni-gram method. It converts single word to a feature in the vector.

We can also use bi-grams (2 words )or tri-grams (3-words), to deal with words like: not a morning’ (tri-gram), ‘after departing’ (bi-gram).

We can also use character n_grams instead of word n_grams. It gives more depth to the vectors.

Machine Learning Models

Most traditional learning algorithms are developed for single-label classification problems. Therefore a lot of approaches in the literature transform the multi-label problem into multiple single-label problems, so that the existing single-label algorithms can be used.

1. OneVsRest

  • Traditional two-class and multi-class problems can both be cast into multi-label ones by restricting each instance to have only one label. On the other hand, the generality of multi-label problems inevitably makes it more difficult to learn. An intuitive approach to solving multi-label problem is to decompose it into multiple independent binary classification problems (one per category).
  • In a “one-to-rest” strategy, one could build multiple independent classifiers and, for an unseen instance, choose the class for which the confidence is maximized.

2. Binary Relevance

  • In this case an ensemble of single-label binary classifiers is trained, one for each class. Each classifier predicts either the membership or the non-membership of one class. The union of all classes that were predicted is taken as the multi-label output. This approach is popular because it is easy to implement, however, it also ignores the possible correlations between class labels.
  • In other words, if there’s q labels, the binary relevance method create qnew data sets from the images, one for each label and train single-label classifiers on each new data set. One classifier may answer yes/no to the question “does it contain trees?”, thus the “binary” in “binary relevance”. This is a simple approach but does not work well when there’s dependencies between the labels.

Logistic Regression and Linear SVM models with hyperparameter tuning in OnevsRest approach will be tested on our data.

We see Logistic Regression gives F1 score of 0.403 and LSVM gives the F1 score of 0.395, so we need to try some more advanced techniques to improve our F1 score to greater than 0.41

TOPIC MODELING

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix.LDA converts this Document-Term Matrix into two lower-dimensional matrices — M1, a document-topics matrix and M2, a topic — terms matrix.

For every topic, two probabilities p1 and p2 are calculated. P1 — p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 — p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w.

The current topic — word assignment is updated with a new topic with the probability, product of p1 and p2. After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA. Below is a short implementation of LDA.

After finding the topics we will be concatenating the TFIDF features and topics to get our train and test datasets and will apply Logistic Regression and Linear SVM models on them.

Here, as can be seen, both the LogReg nad Lr-SVM models give F1 score of greater than 0.41 which is our desired result.

Conclusion:

  1. Logistic Regression and Linear SVM techniques with grid search and one vs rest schema on Character n-grams and lexical features like TFIDF using weighted vectors like unigram, bigram, n-grams give decent results but they can be improved further.
  2. Using LDA (Topic modeling) along with logistic regression and linear SVM the performance is improved to greater than 0.41
  3. We can also employ Deep Learning techniques like LSTM, CNN, RNN to further improve our performance.

Thanks for reading….

References used for this problem are:

  1. https://www.appliedaicourse.com/ for all theory and deep concepts of mathematics, ML and NLP.
  2. https://www.aclweb.org/anthology/L18-1274
  3. https://www.kaggle.com/konohayui/topic-modeling-on-quora-insincere-questions/log
  4. Wikipedia, Google Search

--

--