The NLP Playbook, Part 1: Deep Dive into Text Classification

Capital One Tech

Published in

Capital One Tech

11 min readMay 2, 2018

By Mackenzie Sweeney and Bayan Bruss, Machine Learning Engineers, Capital One

Introduction

Capital One established the Center for Machine Learning (C4ML) in 2016 to catalyze the adoption of machine learning across the business; in that capacity, our team serves as an in-house, enterprise-wide consultancy and center of excellence for machine learning product delivery, innovation, education and partnership. As C4ML has rapidly expanded, we have tackled problems in a variety of domains, seeking always to balance use of state-of-the-art techniques with practical concerns for speed of development, scalability, and explainability. One domain C4ML frequently operates in is Natural Language Processing (NLP). Last year, we started distilling the latest literature to present best-in-class techniques for a variety of NLP problems to our colleagues in C4ML, and we’d like to expand our discourse to the broader community of ML practitioners through a series of blog posts here. While the structure may vary, our goal in each post is to provide a tidy comparison of the best models in some NLP problem category. At the same time, we will discuss some of the ongoing efforts at Capital One to use and advance these techniques.

In this post, we start with a canonical problem in NLP: text classification. At its simplest, this entails training a model to learn the patterns of language that distinguish between a number of known classes. This can be two classes (binary) or more (multi-class). For a good introduction to classification in general, check out this blog post by David Fumo.

The table below provides a few examples of important business applications of text classification at Capital One.

Specifically, we’ll focus on sentiment analysis and topic categorization. We’ll first explain various ways to formulate these problems. To empower you to develop your own powerful systems for these tasks, we’ll then provide a decision tree to help you select between four different methods that have emerged as best-in-class in recent years.

Applications

Sentiment Analysis & Topic Categorization

Sentiment analysis attempts to determine the sentiment of a text. You can formulate this problem in several ways, depending on your working definition of “sentiment” and “text.”

Sentiment can be binary, categorical, ordinal, or continuous. When modeled as continuous, sentiment is often called “polarity,” an analogue for positive and negative charges. The graphic below illustrates these different options and provides an example of each.

The definition of “text” is the level of composition you’re concerned with: document, sentence, entity, or aspect level. The last two are best explained by example. Imagine a review of a smart phone, discussing its merits and flaws. The reviewer may discuss the positive and negative aspects of the camera. Perhaps it has really high resolution but limited zoom distance. In this example, the phone’s camera is the entity, and sentiment is being expressed about multiple aspects of the camera — its resolution and zoom distance. While this level of sentiment understanding is often ideal, it is also the most challenging to model accurately. One way to go about this challenging task is to combine document-level sentiment analysis with another classification approach called topic categorization.

Primer on State-of-the-Art Methods

You may be wondering where to get started on your own NLP problems. Machine learning projects within Capital One typically start with a literature review of the state-of-the-art solutions; however, the experiments of academic literature rarely match a specific business problem. We must translate from established best-in-class methods to custom solutions for our specific use cases. The decision tree below represents our typical translation process, which we provide to help jump start your own. All it requires is an understanding of your data characteristics and your own time constraints. The remainder of this section will discuss the methods outlined in this decision tree so you can quickly get started applying those you think most applicable for your problem.

Let’s start with the simplest model first. In 2016, Joulin et al. provided empirical evidence that, for smaller, simpler datasets, a logistic regression model trained on the Term Frequency-Inverse Document Frequency (TFIDF) matrix performs as well the best neural net models; it is also an order of magnitude faster to train and much simpler to implement. While that performance drops off precipitously as the data size increases, this model is still a good baseline which any more complicated model should be required to outperform. There are a lot of great resources out there explaining what TFIDF is, how logistic regression works, and how to implement them for text classification.

In the last decade or so, deep learning models have demonstrated cutting-edge performance on nearly every task in NLP. At the start of all these models is a common task: embedding the text representation into vectors that can be used by a neural network. Most techniques do this by factorizing the input units into a dense embedding matrix. When these units are words, this approach nicely handles the curse of dimensionality induced by the large size of most natural language vocabularies.

Character CNN

While this factorization approach is still useful at the character level, recent research has shown it is not necessary for good performance. Working at the character level offers other benefits: it eliminates the need for preprocessing, can handle out-of-vocabulary words, and is a language-agnostic approach. For instance, Zhang’s Character CNN (2015) skips the factorization step and embeds text using a quantization procedure — the first published method to do so. The GIF below demonstrates how this works.

Character CNN Architecture (reference 1)

The character CNN consists of the following stages:

1. Select your character set and perform character quantization.

2. Choose a suitable length for your text and truncate/pad all inputs to this size. Zhang et al. chose 1014, saying “It seems that 1014 characters could already capture most of the texts of interest.”

3. Feed the resulting vectors through 6 convolutional layers with max pooling between. The pooling layers are size 3 and non-overlapping. Use kernels of size 7 in the first 2 layers, followed by size 3 in the last 4.

4. Feed the output of the convolutional block through 3 fully-connected layers with 50/50 dropout in between each.

5. Feed the final output through your classification layer.

Character CRNN

Xiao and Cho’s character level CRNN (2016) replaces some convolutional layers in Zhang’s model with recurrent layers and adds back in the embedding layer. They claim this more efficiently captures lengthy dependencies in the text sequence. This approach reduces the parameters needed by 1.35 to 90 times, and seems to improve performance on small datasets. The recurrent layers used in this model use long short-term memory (LSTM) units.

The character CRNN consists of the following stages:

1. One-hot encode characters, retaining those in your character set of choice.

2. Embed encoded characters into dense, real-valued vectors of size 8.

3. Apply 2–3 convolutional layers; use a kernel size of 3 in the last layer and 5 in preceding layers. Use max pooling between convolutions with size 3.

4. Apply 50/50 dropout, then feed through a bi-directional LSTM of dimension 128; apply 50/50 dropout again.

5. Take the last hidden state of both directions and concatenate to form a 2-dimensional vector. Feed that into your classification layer.

Character CRNN Architecture (reference 2)

Very Deep CNN

Conneau et al. (2017) were also inspired by the work of Zhang to build a text classification model at the character level. Like Xiao and Cho, they identified difficulty in capturing long-term dependencies in text as a limitation in Zhang’s character CNN, but they took a very different approach to addressing this problem, arguing “that LSTMs are generic learning machines for sequence processing which are lacking task-specific structure… Texts have similar properties: characters combine to form n-grams, stems, words, phrase, sentences etc.”

Inspired by the successful application of huge stacks of convolutions wired up and trained with the latest and greatest practitioners’ tricks (Simonyan and Zisserman, 2014, He et al., 2016), Conneau and his collaborators designed their very own “Very Deep CNN” (VDCNN) for text classification. This monstrous model employs 29 convolutional layers (a five-fold increase from Zhang’s model) with factorized character-level embeddings to conquer the others. By stacking many convolutions with smaller (size 3) kernels, they claim their network can learn on its own the best way to combine these “3-gram features” into new, hierarchically constructed features. Their experiments show that their approach does indeed outperform the character level CNN and CRNN on large datasets, but it underperforms on smaller datasets. More layers still can’t buy you a free lunch.

FastText (Joulin et al., 2016)

In 2016, Facebook AI Research released a model called FastText. FastText utilizes a number of known performance enhancement techniques to get near state-of-the-art accuracy with significant gains in training speed compared to the other methods discussed above. The core of FastText relies on the Continuous Bag of Words (CBOW) model introduced in Mikolov’s Word2Vec paper (2013). In the original CBOW model, several words from a sentence are passed into a single layer feed-forward neural network, and the model seeks to predict the word that should be in the middle of those words. FastText replaces the objective of predicting a word with predicting a category. These single layer models are incredibly fast to train and scale very well.

FastText Model Architecture (reference 4)

Beyond repurposing the CBOW model for a text classification task, the authors use a number of “tricks” for speed and accuracy improvements. The two main tricks are replacing the softmax over categories with a hierarchical softmax, and using n-gram features in conjunction with the dimensionality-reducing hashing trick.

FastText also performs very well on a number of other tasks. We’ll be covering those in upcoming posts. We’re generally pretty pleased with the ethos of simplicity and good engineering behind FastText. As Woody Guthrie once said, “any fool can make something complicated. It takes a genius to make it simple.”

Building the Decision Tree

So how did we make sense of these four methods to construct our decision tree? Zhang et al. were kind enough to collect, clean, and release a set of benchmark datasets to fuel effective comparisons between text classification methods. In their words:

“…most open datasets for text classification are quite small… Therefore, instead of confusing our community more by using them, we built several large-scale datasets for our experiments, ranging from hundreds of thousands to several millions of samples.”

The authors of all four methods evaluate their performance using these datasets, enabling straightforward comparison. The table below provides some general characteristics for each dataset.

While all the models use these datasets, FastText was published last, and its comparisons include all the other methods discussed here. So we’ve included the results from the FastText paper, shown below.

FastText performs very well across text classification tasks. As can be seen in these results, it consistently outperforms the character CNN and CRNN on the smaller datasets. On the larger datasets, VDCNN outperforms FastText. However, the decision is not as simple as “use FastText for small datasets and VDCNN for big ones.” VDCNN is a much more complex model, and that complexity brings with it a cost to train. The table below shows a speed comparison between the character CNN, VDCNN, and FastText. While the character CNN can take days and VDCNN can take hours, FastText trains in a matter of seconds on even the largest dataset.

Finally, we must remember that FastText is operating at the word level, while the other methods all operate at the character level. While more recent versions of FastText employ sub-word “morphemes” (Bojanowski, 2016) to handle out-of-vocabulary words, it is still not language-agnostic. In particular, FastText’s word embeddings require users to use language-specific preprocessing to deal with often noisy data.

Conclusion

We hope this post has given you a useful playbook for tackling your own text classification problems. While we don’t consider any of these methods to have “solved” these problems, we do believe having them in your toolkit and knowing how and when to apply them will help you devise effective solutions to your own business problems. In future posts, we plan to provide a similar toolkit and playbook for other NLP problems.

References

1. Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. “Character-Level Convolutional Networks for Text Classification.” In Advances in Neural Information Processing Systems, 649–657.

2. Xiao, Yijun, and Kyunghyun Cho. 2016. “Efficient Character-Level Document Classification by Combining Convolution and Recurrent Layers.” arXiv:1602.00367 [Cs], January. http://arxiv.org/abs/1602.00367.

3. Conneau, Alexis, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2017. “Very Deep Convolutional Networks for Text Classification.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 1:1107–1116.

4. Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” arXiv:1607.01759 [Cs], July. http://arxiv.org/abs/1607.01759.

5. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Identity Mappings in Deep Residual Networks.” arXiv:1603.05027 [Cs], March. http://arxiv.org/abs/1603.05027.

6. Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv:1409.1556 [Cs], September. http://arxiv.org/abs/1409.1556.

7. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” arXiv:1310.4546 [Cs, Stat], October. http://arxiv.org/abs/1310.4546.

8. Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. “Enriching Word Vectors with Subword Information.” arXiv:1607.04606 [Cs], July. http://arxiv.org/abs/1607.04606.

9. Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. “A Neural Probabilistic Language Model.” Journal of Machine Learning Research 3 (Feb): 1137–1155.

10. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022.

11. Gal, Yarin, and Zoubin Ghahramani. 2015. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” arXiv:1506.02142 [Cs, Stat], June. http://arxiv.org/abs/1506.02142.

These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018 Capital One.