Multi-label Text Classification with Scikit-learn and Tensorflow

Genre classification of Netflix’s content based on its description

Rodolfo Saldanha
The Startup
9 min readMay 8, 2020

--

Content

  • Context
  • Exploratory Data Analysis (EDA)
  • Preprocessing
  • Multi-label models
  • Conclusion

Context

Multi-label classification is the generalization of a single-label problem, and a single instance can belong to more than one single class. According to the documentation of the scikit-learn package, “This can be thought of as predicting properties of a sample that are not mutually exclusive.” There are no constraints about the number of classes that an instance can be assigned to in a multi-label problem. In a similar context, there exists the multi-class classification problem. However, the key difference between both is the fact that multi-label classification supposes that the properties are not mutually exclusive.

A series of real-life problems can be represented as a multi-label classification problem, such as topic categorization of articles. The problem tackled in this article is the genre categorization of Netflix’s movies and shows, which can belong to more than one category at the same time. The dataset used is the Netflix Movies and TV Shows, and any doubts about the code, please refer to my Kaggle kernel.

Exploratory Data Analysis (EDA)

The first step of any machine learning problem is the EDA to have a better understanding of the data. The dataset contains many fields that are not relevant to this problem, and all genres are clustered within the same column.

data.head()
Figure 1 — Dataset first rows

After some data wrangling removing useless columns and segmenting the categories, this is how the data resembles.

Figure 2 — Dataset after data wrangling

Now, it is nice to have a closer look at the most frequent genres among the titles and check the categories with too few data points. In order to do that, we are going to use the seaborn and matplotlib packages.

Figure 3 — Most common categories

Some categories do not have enough data points, and it makes the prediction less precise. Therefore, I set an arbitrary threshold of 200 titles, and the categories below the threshold are clustered together in a new category named Others, making a total of 21 genres.

Figure 4 — Most common categories above the threshold

Counting the number of titles having multiple labels and calculating the word frequency can be helpful as well.

Figure 5 — Number of categories per title
Figure 6 — Distribution of the word frequency in the description attribute

Most titles belong to three genres, while the majority of the descriptions have about 150 words. The word cloud below shows the most frequent words in the description field.

Figure 7— Word cloud of the description field

Preprocessing

After having explored the dataset, the data preparation for modeling may begin. A great data normalization is crucial to achieving good results in NLP, and there is a variety of packages for doing that, but the ones we used are the nltk and the regex.

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

We start by converting the words to the lower-case, replacing word contractions with their full form, and removing punctuation and numbers. It is also important to remove the stop-words because they do not add any value to the model, and stem the words to maintain only the root of it.

def decontract(sentence):
sentence = re.sub(r"n\'t", " not", sentence)
sentence = re.sub(r"\'re", " are", sentence)
sentence = re.sub(r"\'s", " is", sentence)
sentence = re.sub(r"\'d", " would", sentence)
sentence = re.sub(r"\'ll", " will", sentence)
sentence = re.sub(r"\'t", " not", sentence)
sentence = re.sub(r"\'ve", " have", sentence)
sentence = re.sub(r"\'m", " am", sentence)
return sentence

def removePunctuation(sentence):
sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence)
sentence = sentence.strip()
sentence = sentence.replace("\n"," ")
return sentence

def removeNumber(sentence):
alpha_sent = ""
for word in sentence.split():
alpha_word = re.sub('[^a-z A-Z]+', '', word)
alpha_sent += alpha_word
alpha_sent += " "
alpha_sent = alpha_sent.strip()
return alpha_sent

def removeStopWords(sentence):
return stopwords.sub("", sentence)
def stemming(sentence):
stemmer = SnowballStemmer("english")
stemmedSentence = ""
for word in sentence.split():
stem = stemmer.stem(word)
stemmedSentence += stem
stemmedSentence += " "
stemmedSentence = stemmedSentence.strip()
return stemmedSentence

Stemmers remove morphological affixes from words, leaving only the word stem.

There is a variety of stemmers, and each one acts slightly differently from the others. There are two available in the nltk package, the Porter and the Snowball, and the one used here is the Porter stemmer.

After splitting the data into training and test sets, we can start modeling.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(
data['description'], data[data.columns[1:]], test_size=0.3, random_state=seed, shuffle=True)

Multi-label models

There exists multiple ways how to transform a multi-label classification, but I chose two approaches:

  • Binary classification transformation This strategy divides the problem into several independent binary classification tasks. It resembles the one-vs-rest method, but each classifier deals with a single label, which means the algorithm assumes they are mutually exclusive.
  • Multi-class classification transformation — The labels are combined into one big binary classifier called powerset. For instance, having the targets A, B, and C, with 0 or 1 as outputs, we have A B C -> [0 1 0], while the binary classification transformation treats it as A B C -> [0] [1] [0].

The evaluation metric to measure the performance of the models is the AUC measure, which stands for “Area Under the ROC Curve.” A ROC curve is a graph showing the performance of a classification model at all classification thresholds.

Figure 8 — AUC (Area Under the Curve)

This curve plots two parameters:

  • True Positive Rate

TPR = TP/(TP+FN)

  • False Positive Rate

FPR = FP/(FP +TN)

TP = True Positive; FP = False Positive; FP = False Positive; FN = False Negative

A model’s performance is assessed after running it with 5 different seeds to try to mitigate any bias.

Scikit-learn

First of all, it is necessary to vectorize the words before training the model, and here we are going to use the tf-idf vectorizer.

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(X_train)
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

1. OneVsRestClassifier

The estimator used was RandomForestClassifier, and since the labels are analyzed separately, the result is the average of the AUC score of the categories.

from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
Figure 9 — AUC score per category

AUC score: 0.517097

2. BinaryRelevanceClassifier

This method is very similar to the OneVsAll, but not the same. If there are x labels, the binary relevance method creates x new datasets, one for each label, and trains single-label classifiers on each new data set. One classifier may answer yes/no, thus the “binary relevance.” This is a simple approach but does not work well when there are dependencies between the labels.

The estimator used is GaussianNB (Gaussian Naive Bayes).

from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
classifier = BinaryRelevance(GaussianNB())
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

print('AUC score: {}'.format(roc_auc_score(y_test,predictions.toarray())))

AUC score: 0.544241

3. ClassifierChain

This approach combines the computational efficiency of the Binary Relevance method while still being able to take the label dependencies into account for classification. On the other hand, that makes this method more expensive computationally speaking.

The estimator used is LogisticRegression.

from skmultilearn.problem_transform import ClassifierChain
from sklearn.linear_model import LogisticRegression
classifier = ClassifierChain(LogisticRegression())
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print('AUC score: {}'.format(roc_auc_score(y_test,predictions.toarray())))

AUC score: 0.519823

4. MultiOutputClassifier

This strategy consists of fitting one classifier per target(A B C -> [0 1 0]). This is a simple strategy for extending classifiers that do not natively support multi-target classification.

The estimator used is KNeighborsClassifier.

from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
clf = MultiOutputClassifier(KNeighborsClassifier()).fit(X_train, y_train)
predictions = clf.predict(X_test)
print('AUC score: {}'.format(roc_auc_score(y_test,predictions)))

AUC score: 0.564452

Tensorflow

Text classification has benefited from the deep learning architectures’ trend due to their potential to reach high accuracy. There are different libraries available for deep learning, but we chose to use here Tensorflow because, alongside with PyTorch, they have become the most popular libraries for the topic.

On the other hand, word embeddings are low dimensional as they represent tokens as dense floating-point vectors and thus pack more information into fewer dimensions. This technique normally gives a performance boost in NLP tasks, for instance, syntactic parsing and sentiment analysis. It is possible to either train the WordEmbedding layer or use a pre-trained one through transfer learning, such as word2vec and GloVe.

For the following models, the vectorization used was texts_to_sequences, which transforms the words in numbers, and the pad_sequences ensures all the vectors have the same length.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(data['description'])
sequences = tokenizer.texts_to_sequences(data['description'])
x = pad_sequences(sequences, maxlen=200)

Class weights were calculated to address the imbalance problem in the categories.

most_common_cat['class_weight'] = len(most_common_cat) / most_common_cat['count']
class_weight = {}
for index, label in enumerate(categories):
class_weight[index] = most_common_cat[most_common_cat['cat'] == categories]['class_weight'].values[0]

most_common_cat.head()
Figure 10 — Class weights

1.DNN with WordEmbedding

We started with a simple model which only consists of an embedding layer, a dropout layer to reduce the size and prevent overfitting, a max-pooling layer, and one dense layer with a sigmoid activation to produce probabilities for each of the categories that we want to predict.

from keras.models import Sequential
from keras.layers import Dense, Embedding, GlobalMaxPool1D
from keras.optimizers import Adam
import tensorflow as tf
model = Sequential()
model.add(Embedding(max_words, 20, input_length=maxlen))
model.add(GlobalMaxPool1D())
model.add(Dense(num_classes, activation='sigmoid'))
model.compile(optimizer=Adam(0.015), loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC()])
Figure 11 — DNN architecture

AUC score: 0.890245

2. CNN with WordEmbedding

Convolutional Neural Networks recognize local patterns in a sequence by processing multiple words at the same time, and 1D convolutional networks are suitable for text processing tasks. In this case, the convolutional layer uses a window size of 3 and learns word sequences that can later be recognized in any position of a text.

from keras.layers import Dense, Activation, Embedding, Flatten, GlobalMaxPool1D, Dropout, Conv1Dfilter_length = 300model = Sequential()
model.add(Embedding(max_words, 20, input_length=maxlen))
model.add(Conv1D(filter_length, 3, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPool1D())
model.add(Dense(num_classes))
model.add(Activation('sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC()])
Figure 12 — CNN architecture

AUC score: 0.886286

3. LSTM with GloVe WordEmbedding

In this model, we will use GloVe word embedding to convert text inputs to their numeric counterparts, which is a different approach because this is a pre-trained layer. The model will have one input layer, one embedding layer, one LSTM layer with 128 neurons, and one output layer with 21 neurons (the number of targets.)

from keras.layers import Flatten, LSTM
from keras.models import Model
deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(max_words, 100, weights=[embedding_matrix], trainable=False)(deep_inputs)
LSTM_Layer_1 = LSTM(128)(embedding_layer)
dense_layer_1 = Dense(21, activation='sigmoid')(LSTM_Layer_1)
model = Model(inputs=deep_inputs, outputs=dense_layer_1)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[tf.keras.metrics.AUC()])
Figure 13 — LSTM architecture

AUC score: 0.887574

Conclusion

In this article, we went through all the steps of a multi-label classification problem, starting with the initial data analysis, passing by the preprocessing step, and ending up modeling the data and benchmarking the classification results.

Figure 14 —AUC scores over five different seeds
Figure 15 —Mean of the AUC scores

In conclusion, based on the benchmark, the Deep Neural Network showed the best AUC score, but the difference is minimal among the deep learning models, and the CNN and the LSTM have similar performances. On the other hand, the algorithms available in the scikit-learn package presented scores considerably lower, and they are suitable for this problem.

--

--