Multi-Class Text Classification with Extremely Small Data Set (Deep Learning!)

Ruixuan Li
8 min readApr 29, 2020

--

image

Surveys with open-ended questions are widely used in customer feedback, reviews, health survey, and all other types of surveys. To analyze the response to open-ended questions, it should be converted from qualitative data to quantitative data by coding.

Survey runners usually train human coders, provide them with a coding guide book, and have the coders manually code the response to different categories. However, this procedure experiences few problems. First, coding open-ended responses are time-consuming and can be costly to hire and train coders to do this manually. Second, coders may find it difficult to follow the guide book when the unintended cases appear. Third, inconsistency across different coders may introduce errors and bias which is harmful to future analysis.

A multi-class text classifier can help automate this process and deliver consistent coding results. To train a machine learning model, a large data set usually benefits the performance, however, creating large-scale human coded data set is also costly. Here, we will show you that with an extremely small human-labeled data set, we can still get somewhere on multi-class text classification utilizing the pre-trained language model.

Data

The data used in this post came from a health survey. In this survey, one general health question was asked taking the format ‘If we ask you to evaluate your health status, what are the main factors you would consider in your evaluation?’ One example response is ‘I generally have really good health and no major health problems.’ Then, human coders will assign a tone label(Positive, Negative, or Neutral) and a health code label (General health statement, etc.) to this response. Since each response may contain multiple attributes, I split the responses with more than code so each record will only have one tone label and one health code label. The final data set has 3,750 human-labeled records. (which is small!)

Here is some distribution of the categories in our small data set.

Tone Distribution
Code Distribution (People seldomly think about family history when thinking about their health condition. Hmmm…)

Note that the code distribution is imbalanced, however, in this study, we are not specifically interested in the small categories. We will go with it!

Traditional Classifier

Before getting into the exiting deep learning method, let’s try to solve this problem with a few lines. It never harms to try something easy first!

First, let’s take a look at how Naive Bayes classifier solves this problem. We will use the NaiveBayesClassifier in nltk.classify package. Before everything, let’s import the needed packages.

To make text representations, nltk package provides a great function called word_tokenize(). For text preprocessing, sometimes we don’t want to use only one single word as a feature, but we would like to try more combinations to capture the relation ship of the neighbor words. To fulfill this purpose, ngrams() is used in the text representation parts.

After we convert the raw text to a series of features, we can have the Naive Bayes classifier running.

To see the full code, please go to https://github.com/LEERAYX/Multi-label-text-classification/blob/master/naive_bayes_class.py

Here is the output of the classification accuracy with Naive Bayes classifier.

Hmmm… Naive Bayes gives okay results on tone classifier but terrible on code classifier. It is probably because Naive Bayes classifier is not good at dealing with classification tasks with multi-labels (9 categories in this case).

Now, let’s try Support Vector Machines classifier! This time, we will utilize LinearSVC function in sklearn package (widely used python machine learning package). We impor the needed packages first.

SVM determines the best boundary between vectors that identify vectors belong or not to a given category. Term frequency, Inverse Document Frequency (tfidf) was used to create text features for each record. The text was first tokenized, converted to lower case, and had stop words removed. Then, each record was converted to vector using tfidf method. Note that tfidf vectorization only included tokens that present in documents at least 4 times, used ‘l2’ normalization, and considered both unigrams and bigrams.

To see detailed code, please see https://github.com/LEERAYX/Multi-label-text-classification/blob/master/Linear_SVM_class.py

Let’s see how SVM classifier performs.

Wow! SVM classifier actually wors well on both classification tasks! SVM usually works well in high dimention spaces and cases when the number of features is bigger than records. However, it is not suitable for large data sets and will be affected when the data set is noisy. SVM typically will not work well if we want to generalize the model further for a larger data set and more upcoming responses.

But can we go further? The answer is yes.

Deep Learning Classifier with Pre-Trained Model

A new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers, was introduced in 2019 by researchers at Google AI Language. It can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks, such as question answering and language inference. For more details please check BERT research paper.

The research shows that this bidirectionally pre-trained language model can have a deeper sense of language context. One main fallback of Naive Bayes and SVM Classifier is that they are not able to capture the sentence context or semantic relationship between words. Another problem is that when it comes to tokens that have never appeared in the training set, the classifier will treat them as the same out-of-bag tokens. This problem is more significant when we have a rather small training set that is impossible to cover a large-scale vocabulary that might be used in these responses. However, the pre-trained language model can solve this problem at some level and give better performance.

To train a deep learning classifier built upon the pre-trained language model, we should first fine-tuned our language model so it can recognize the language used in our data set. We utilized the transformers package in this part. The transformers package is a powerful NLP tool that provides incredible resources. For more details of this package please go to transformers Github.

The script we fine-tuned our language was modified from the original run_generation.py from transformers documents. And here is our code adjusted for this project. Our fine-tuned language model was trained base on the ‘distilbert-base-uncased’ pre-trained model.

Once the language model specifically trained for our task was ready, we can train our own deep learning classifier!

The classifier structure utilized the ‘DistilBertForSequenceClassification’. The detailed code please go to my Github page.

Here is a comparison of the deep learning classifier and the other two traditional classifiers.

Comparison of the three models

For tone classification, SVM classifier can solve the problem on some stage. However, SVM is limited with the recognition of the presence of words in the training data set and is unable to capture the inner relationship of the sentence. The failure to classify ‘I’ve never had a history of cancer, diabetes, hepatitis, HIV, or any other STD.’ is a good example of this limitation. In this sentence, the appearance of multiple diseases fooled the SVM model to label this as a ‘Negative’ statement while the negation in the first place converted the tone completely.

The DistilBertForSequence classifier did a good job overall on tone classification with accuracy 78.4%. And by looking at the mislabeled examples, we have to admit that these texts are ambiguous and hard to be classified.

The deep learning model can capture the semantic features of the text better than the SVM model and Naive Bayes model. One example is that our DistilBertForSequence classifier successfully labels ‘ I’ve never had a history of cancer, diabetes, hepatitis, HIV, or any other STD.’ as Positive while SVM classify this as Negative.

For some of the text, I would even agree with the label given by our deep learning classifier. For instance, the text ‘ It could be a little better ’ was labeled as Negative by our DistilBertForSequence classifier while the human coder label it as Neutral. What do you think?

For health code classification, the DistilBertForSequence classifier increased the accuracy of classification about 4% from the baseline SVM model.

The small data set affects SVM classifier’s performance. And when it comes to a general expression like ‘I can still get around’, SVM model tends to classify it as Other ’ category since it does not understand the expression as a whole sentence.

Although the DistilBertForSequence classifier did not give a perfect answer for this multi-class text classification task, it still shows some advantages. Instead of classifying the text as an unknown expression, the model tried to understand the meaning of the sentences as a whole.

By looking at the labels given by this classifier, a number of incorrect classification examples are ambiguous and hard to be classified by human coders. For example, the DistilBertForSequence classifier also failed on classifying ‘I can still get around.’ as SVM did. The label assigned by the human coder is ‘Physical performance’ while the DistilBertForSequence classifier labeled this as ‘Absence or presence of illness’. However, SVM classifier tend to classify these unclear cases as the ‘Other’ category. This is a sign that our deep learning classifier model is trying to learn the sentence content.

One other advantage that the pre-trained deep learning model has is that it was pre-trained with the large corpus. The words, expressions, and other language features were encoded without looking at the limited training data set. The pre-trained model suffers less about the words that never appear in the training set. So it can be generalized to a larger data set by training language model based on what we have already trained and build a better classifier.

Conclusion

A deep learning classifier (DistilBertForSequence classifier) can give acceptable results on multi-class text classification tasks with an extremely small data set. It outperforms the traditional classifier as SVM and Naive Bayes classifier on this task. However, it still needs further techniques to improve the language model and classification model before it can fully replace human work.

So are you excited about solving your own NLP tasks by easy to use pre-trained deep learning models? Tell me about it!

--

--