Multi-Label Text Classification for Beginners in less than Five (5) minutes
So before we dive deep into Multi-label text classification — let’s understand
what multi-label text classification is —
Based on Wikipedia — Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem, there is no constraint on how many of the classes the instance can be assigned to.
Let’s understand this with an example
so if we have a dataframe or a table with 2 columns
- product_name
- product_type
If each product can be assigned only to one product type then it is a multi-class text classification (not multi-label)
example —
If each product name can be assigned to multiple product types then it comes under multi-label text classification ( as the name suggests — you are assigning multiple labels to the input text)
Here you can see that multi-labels are assigned to one category. One movie name can be romantic as well as comedy. So these kinds of problems come under multi-label text classification
Basic steps to follow —
Pre-processing of the input data and the output variable
- There are many ways to go about it — Removing extra white spaces, extra punctuation, converting everything to lower case, removal of stop words. etc. example of removing space —
2. Use Multi-label binarizer to transform into multi-label format — (pretty good explanation — https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) o=
3. Let’s see how your target variable looks like — if the target variable tags are not balanced properly, then assign weights to the tags. —
4. Tokenization, padding ( Pre-processing of the input data)
5. Divide your data into training and testing set, Fit your model on the training set and then evaluate it on the testing set.
6. It is good to have some accuracy plots ( do plot them — the accuracy of training vs testing)
7. Once your model is ready you can save the tokenizer as *.pickle and model file as *.h5 and then as and when required to run the predictions you can load your model and you are ready :)
Link to my GitHub repository — https://github.com/DeeptiAgl?tab=repositories