Deep Learning Techniques for Text Classification
Evaluate the performance of TCN and Ensemble-based models using Word2Vec to your common deep learning architectures
A. Introduction
A.1. Background & Motivation
Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes. The classes can be based on topic, genre, or sentiment. Today’s emergence of large digital documents makes the text classification task more crucial, especially for companies to maximize their workflow or even profits.
Recently, the progress of NLP research on text classification has arrived at the state-of-the-art (SOTA). It has achieved terrific results, showing Deep Learning methods as the cutting-edge technology to perform such tasks.
Hence, the need to assess the performance of the SOTA deep learning models for text classification is essential not only for academic purposes but also for AI practitioners or professionals that need guidance and benchmark on similar projects.
A.2. Objectives
The experiment will evaluate the performance of some popular deep learning models, such as feedforward, recurrent, convolutional, and ensemble-based neural networks, on five text classification datasets. We will build each model on top of two separate feature extractions to capture information within the text.
The result shows:
- the robustness of word embedding as a feature extractor to all the models in making a better final prediction.
- the effectiveness of the ensemble-based and temporal convolutional neural network in achieving good performances and even competing with the state-of-the-art benchmark models.
B. Experiment
B.1. Datasets
- MR. Movie Reviews — classifying a review as positive or negative [1]. Link
- SUBJ. Subjectivity — classifying a sentence as subjective or objective [2]. Link
- TREC. Text REtrieval Conference — classifying a question into six categories (a person, location, numeric information, etc.) [3]. Link
- CR. Customer Reviews — classifying a product review (cameras, MP3s, etc.) as positive or negative [4]. Link
- MPQA. Multi-Perspective Question Answering — opinion polarity detection [5]. Link
- To make things easy, we have prepared the datasets in the pickle format here.
B.2. The Proposed Models
B.2.1. Temporal Convolutional Network (TCN)
Shaojie Bai et al. [6] proposed a generic temporal convolutional network (TCN) as a dilated-causal version of CNN. It works as a strong alternative to recurrent architectures that can handle a long input sequence without suffering from vanishing or exploding gradient problems. If you care to learn more about the model blocks, you can refer to [6] and [7] for the implementation.
The proposed TCN model is inspired by Christof Henkel [8], one of the grandmasters on Kaggle. The model consists of:
- Two TCN blocks stacked with the kernel size of 3 and dilation factors of 1, 2, and 4.
- The first TCN block contains 128 filters, and the second block uses 64 filters. The input features will be based on Word Embedding.
- Each block’s result will take the form of a sequence.
- The final sequence is then passed to two different global pooling layers.
- Next, both results are concatenated and passed into a dense layer of 16 neurons, and pass to the output.
B.2.2. Ensemble CNN-GRU
K. Kowsari et al. [9] introduced a novel deep learning technique for classification called Random Multimodel Deep Learning (RMDL). The model can be used for any classification task. The figure below illustrates an architecture using deep RNN, deep CNN, and deep feedforward neural network (DNN).
In this project, we implement an ensemble learning-based model by combining 1D CNN with a single Bidirectional GRU (BiGRU).
- The 1D CNN has been proven to work well on text classification despite only a little parameter tuning [10].
- On the other hand, BiGRU works well on temporal data by taking both earlier and later information in the sequence.
We will see how this combination affects the model accuracy in the experiment.
B.2.3. Other Models
To compare the performance, we will also evaluate other popular models such as:
- SNN. A shallow neural network.
- edRVFL. Ensemble deep random vector functional link neural network.
- 1D CNN. Our baseline model representing a neural network with a one-dimensional convolution and pooling layers.
- (Stacked) BiGRU/BiLSTM. Bidirectional Gated Recurrent Unit / Long Short-Term Memory. Its stacked version means we add another bidirectional block to the network.
B.2.4. The Models Summary with their Feature Extractions
To sum up, we will build deep learning models using two different feature extractions on five text classification datasets as follows:
- WE-rand. The model uses an embedding layer where the word vectors are randomly initialized and corrected during training
- WE-static. The model uses pre-trained word embedding called Word2Vec with 300-dimensional vectors. The vectors are kept static during training. The vectors for unknown words are randomly initialized using a generic normal distribution.
- WE-dynamic. Same as above, but the vectors are modified during training, not static.
- WE-avg. The model uses the average of vectors from the pre-trained word embedding to get the input context. Hence, the size of input features will be the same as the size of the vector dimension used in the Word2Vec, 300.
- Bag-of-Words (BoW). It represents text as the number of word occurrences within a document before feeding It to the model. We will use four word-scoring options: binary, count, freq, and TF-IDF.
The benchmarks used in this work are:
- CNN-multichannel (Yoon Kim, 2014) [10]
- SuBiLSTM (Siddhartha Brahma, 2018) [11]
- SuBiLSTM-Tied (Siddhartha Brahma, 2018) [11]
- USE_T+CNN (Cer et al., 2018) [12]
C. Evaluation
C.1. Results
We will use accuracy and rank as comparison metrics. The rank will be calculated based on the accuracy of each dataset. In the case there are ties, we average their ranks.
Table 8 shows the final comparison for each model performance. We also include the SOTA benchmark models (at the bottom) for further observation. Note that we only include the best results for the models that use the bag-of-words and average word embedding (SNN and edRVFL),
From Table 8, we can calculate the average accuracy margin of the models to the baseline (1D CNN-rand) on the 5 datasets as follows:
In Figure 5, the green bar represents the benchmark model. The purple bar depicts the top six proposed models that beat the baseline. Finally, the red bar is the proposed model with the lowest accuracy margin. The minus (-) sign indicates the model has much lower accuracy than higher ones in all datasets with the baseline as the reference.
From there, we can calculate the average rank values and visualize the result as shown below:
C.2. Discussion
C.2.1. BoW vs. Word Embedding
The models with BoW in this experiment cannot do much despite having so many hyperparameter tuning. The large numbers of text data will make the vocabulary of BoW extensive. Hence, the input features will be in sparse form, presenting a bit of information over many zeros. This text representation makes the model harder to train to achieve a better result. Unless we specify the vocabulary size not big enough or work with a small corpus, BoW cannot be a reliable option.
On the other hand, the models perform better when using word embedding. By only taking the average of Word2Vec to obtain N-dimensional feature inputs, the model can have a very steep increase in accuracy up to 10%. For example, both edRVFL and SNN suddenly jump from 75.2 and 76.2 to 83.6 and 85.8 in the TREC dataset. These results prove the importance of word embedding as a default feature extractor.
C.2.2. Random vs. Static vs. Dynamic
Figure 7 illustrates the effect of different word embedding modes on the model performance. As expected, the static word embedding using pre-trained Word2Vec always performs better. The static mode can help any models predict classes more accurately up to a 3% average accuracy increase than the random mode.
The dynamic vector representation model will fine-tune the parameters initialized by Word2Vec vectors to learn the meaningful context for each task. Ideally, it will result in better performance than the static one. However, that is not always the case. Although the model can still improve, the change is not significant. In some cases, a model can even have lower accuracy.
In Figure 7, the dynamic mode slightly lowers the overall model performance on TREC and MPQA datasets. In Table 8, although BiGRU-dynamic offers better performance than its static version in the SUBJ dataset, it decreases performance on the other datasets. This is because the vectors adjust to a specific dataset that can overfit and change the original context derived from Word2Vec.
C.2.3. TCN vs. RNN Model
If we use word embedding, TCN is more effective than RNN-based models like LSTM or GRU. In four of five datasets, TCN outclasses all the RNN architectures with an excellent accuracy margin. On the other dataset, the TCN accuracy is still high and close to the highest ones. TCN-static and -dynamic sit as the top models, followed by BiLSTM-static, BiGRU-static, and Stacked BiGRU-static.
Simply put, TCN is the best model not only compared to the RNN family but also to the other models in capturing information to make a stable prediction. The only type of model that can challenge TCN in this experiment is the ensemble-based model.
C.2.4. Ensemble vs. Single Model
As expected, the ensemble models generally outperform the single-based models in almost all the classification tasks. The ensemble model’s static version provides better performance in 3 out of 5 datasets. The key to ensemble learning is that the candidate models need to be proven to work well on the given task. In this case, the 1D CNN and BiRNN are great models to combine for text classification. The result encourages us to experiment combining a potent model, such as TCN, with other existing good deep learning models in the future.
C.2.5. The Best Performing Models
Finally, Table 9 summarizes the best models in this series of experiments. We use the average accuracy margin in Figure 5 and the average rank values in Figure 6 to compare the top six performing models for classifying text. We can see that the static version of TCN and Ensemble models emerge as the best.
Next, the TCN-dynamic follows as the best model joining the group as the top three. In the end, TCN and ensemble-based models dominate other configurations to perform text classification tasks, making them the best recommend architectures for future application and research.
D. Conclusions and Future Work
D.1. Conclusions
This project has demonstrated a comprehensive experiment focusing on building deep learning models using two different feature extractions on five text classification datasets. In conclusion, the followings are the essential insights:
- Any model built on top of word embedding causes the model to perform exceptionally well.
- Using a pre-trained word embedding such as Word2Vec can increase the model accuracy with a high margin.
- TCN is an excellent alternative to recurrent architecture and has been proven effective in classifying text data.
- The ensemble learning-based model can help make better predictions than a single model trained independently.
- TCN and Ensemble CNN-GRU models are the best performing algorithms we obtained in this series of text classification tasks.
D.2. Recommendation in Future Work
We recommend some suggestions for future experiments as follows:
- An ensemble-based model with TCN. Perform text classification tasks using TCN combined with other good models such as 1D CNN and BiGRU in ensemble-based learning to see if it can challenge the benchmarks even more
- Kernel size and filters. Explore these two hyperparameters by extending the kernel sizes between 1 to 10 with more or fewer filters in CNN or TCN to see how it affects the model performance.
- Deeper network. Any neural network with more hidden layers typically will do better in any task. Explore the deeper version of CNN, RNN, and TCN to see how it affects the existing performance.
- Use GloVe and FastText. Explore other pre-trained word embedding options such as GloVe and FastText with static and dynamic modes and compare the result to Word2Vec.
Thank you,
Diardano Raihan
LinkedIn Profile
Note: Everything you have seen is documented in my GitHub repository.
For those who are curious about the full code, please do have a visit 👍.
References
- [1] B. Pang, L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales”, In Proceedings of ACL’05, 2005.
- [2] B. Pang, L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts”, In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), 2004.
- [3] X. Li, D. Roth, “Learning question classifiers”, In Proceedings of COLING ’02, 2002.
- [4] M. Hu, B. Liu, “Mining and summarizing customer reviews”, In Proceedings of KDD ’04, 2004.
- [5] J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of opinions and emotions in language”, Language Resources and Evaluation, 39(2):165–210, 2005.
- [6] S. bai, J. Kolter, and V. Koltun, “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling”, arXiv, April, 2018.
- [7] P. Rémy, “Keras TCN”, GitHub https://github.com/philipperemy/keras-tcn, January. 2021.
- [8] C. Henkel, “Temporal Convolutional Network”, Kaggle, https://www.kaggle.com/christofhenkel/temporal-convolutional-network, February, 2021.
- [9] K. Kowsari, M. Heidarysafa, D. E. Brown, K. J. Meimandi, and L. E. Barnes, “Random Multimodel Deep Learning for Classification”, arXiv, April, 2018.
- [10] Y. Kim, “Convolutional Neural Networks for Sentence Classification,” Association for Computational Linguistics, October, 2014.
- [11] S. Brahma, “Improved Sentence Modeling using Suffix Bidirectional LSTM”, arXiv, September, 2018.
- [12] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, “Universal Sentence Encoder”, arXiv, April, 2018.