Multi-Label Text Classification for Beginners in less than Five (5) minutes

Deepti Goyal
Analytics Vidhya
Published in
3 min readFeb 25, 2020

So before we dive deep into Multi-label text classification — let’s understand

what multi-label text classification is

Based on Wikipedia — Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem, there is no constraint on how many of the classes the instance can be assigned to.

Let’s understand this with an example

so if we have a dataframe or a table with 2 columns

  1. product_name
  2. product_type

If each product can be assigned only to one product type then it is a multi-class text classification (not multi-label)

example —

Multi-class text classification

If each product name can be assigned to multiple product types then it comes under multi-label text classification ( as the name suggests — you are assigning multiple labels to the input text)

Multi-label text classification

Here you can see that multi-labels are assigned to one category. One movie name can be romantic as well as comedy. So these kinds of problems come under multi-label text classification

Basic steps to follow —

Pre-processing of the input data and the output variable

  1. There are many ways to go about it — Removing extra white spaces, extra punctuation, converting everything to lower case, removal of stop words. etc. example of removing space —
Function for removing space

2. Use Multi-label binarizer to transform into multi-label format — (pretty good explanation — https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) o=

3. Let’s see how your target variable looks like — if the target variable tags are not balanced properly, then assign weights to the tags. —

4. Tokenization, padding ( Pre-processing of the input data)

tokenization and padding

5. Divide your data into training and testing set, Fit your model on the training set and then evaluate it on the testing set.

Training and Testing data

6. It is good to have some accuracy plots ( do plot them — the accuracy of training vs testing)

Training Vs Validation

7. Once your model is ready you can save the tokenizer as *.pickle and model file as *.h5 and then as and when required to run the predictions you can load your model and you are ready :)

Link to my GitHub repository — https://github.com/DeeptiAgl?tab=repositories

--

--