CNNs for text classification

How good are they at NLP tasks with and without RNN techniques?

Jason Benn
Paper Club
6 min readAug 4, 2017

--

Paper Club’s paper this week was Recurrent Convolutional Neural Networks for Text Classification. James Vanneman has already done an excellent job analyzing it, so I’ll keep my own analysis short and instead provide a little context. Was this paper justified in its conclusions? What performance can we achieve on NLP tasks given a reasonable CNN baseline, and how much value does the “Recurrent” part of an RCNN add?

First: text classification

Text classification tasks generally involve classifying a sentence (i.e., into one of 5 sentiments, or into one of 6 question types, etc) and require scanning across the sentence and building up a representation of its constituent words. As a word’s meaning is heavily influenced by its surroundings, it’s important to scan across the sentence with an aperture that considers multiple words at a time. Consider the word “window”. You would probably assume that it was referring to a transparent feature of a wall, no? You might even come to the same conclusion if you saw 1 word of context to either side: “small window sizes,” though perhaps you’d be less confident. Expanding the context to 3 words on either side: “claim that small window sizes lose information”, and now you realize that “window” has a completely different meaning than you had originally surmised. We call the nearby phrase “lose information” discriminative because it caused us to change our understanding of this usage of “window”.

Of the approaches for capturing this context (as presented by the RCNN paper), all have drawbacks:
1. recursive neural nets have O(n²) complexity, making them unsuitable for large tasks

Reading this, I realize that I’d been using the terms “Recursive” and “Recurrent” interchangeably, but it turns out that they’re not the same: a recurrent neural net is the all-star architecture for NLP, and a recursive neural net is an older architectural pattern for learning about tree-shaped data. Recursive nets appear to have fallen out of favor, judging by the fact that arxiv-sanity.com shows only a handful of papers in the last year, and the Wikipedia article is fairly bare compared to the one for Recurrent Neural Nets.

2. recurrent neural nets are biased, in that later words are more dominant than earlier words, and
3. convolutional neural nets that scan over long segments of text at a time (“fixed window kernels with large window sizes”) could work, but these take a long time to train.

So the authors propose that we capture the benefits of both recurrent NNs with convolutional NNs by putting a max pooling layer on top of a bidirectional recurrent NN.

I don’t really get how their proposal follows from the problems of existing architectures. They identify that the problem with CNNs is large kernel window sizes, then proceed to not use a convolutional kernel, just a max pooling layer! I must be misunderstanding the paper.

Skipping right to the results of their experiments:

For which my main takeaway was shock that a plain CNN did as well as it did (2nd row from the bottom), as did ordinary NLP benchmarks (the top 4 rows).

Even crazier, this CNN benchmark is from a 2011 paper by Collobert et. all, released 1 year before AlexNet won ImageNet and convinced everyone that neural nets were worth researching. I skimmed the paper — it includes quaint anachronisms like explaining the basics of neural nets: “The features computed by the deep layers of the network are automatically trained by backpropagation to be relevant to the task”, which tells you a thing about the research community’s knowledge of NNs in 2011, and later refers to “multilayer neural networks” almost apologetically as “twenty year old technology”. In all it seems hardly fair to use this CNN for text classification as a baseline — I think they should have compared their RCNN to an actual peer CNN. I did find such a paper — I discuss it below, but first, a tangent about the Collobert paper!

Aside: the Collobert paper

The Collobert paper is a 47-page beast and still genuinely impressive, especially for the time it was published. It was written at a time when neural networks were fairly unknown and NLP consisted of more manual feature engineering; the authors spend the second half of the paper enumerating the results of combining various old-school NLP techniques with their neural net and often obtain improvements in classification accuracy. Here are summaries of what they tried, as best as I could understand their piles of jargon and acronyms:

  1. Suffix features. “Word suffixes in many western languages are strong predictors of the syntactic function of the word.” They do this by adding word features representing the last two characters of every word in their data set, which resulted in a dictionary of 455 suffixes. (How was this mixed in with the data, exactly?)
  2. Gazetteers. A dictionary of 8,000 well-known named entities (locations, person names, organizations, etc) was used to improve performance on the Named Entity Recognition task (one of the tasks by which they evaluated their model). Any time a sentence chunk matched an entity in the gazetteer, they flipped “on” the features for those entities in the gazetteer (which I suppose makes them available for training?). Clear performance improvement on the NER task.
  3. Cascading. Adding features like part-of-speech tags (is this a verb, noun, participle, etc), chunk tags (chunking is grouping related words together — in this sentence, “related words” form a chunk, as “related” describes “words”), and semantic role labels (which word is the subject? Which is the verb?) result in consistent moderate improvements. (How exactly does one “add word features representing chunk tags”?)
  4. Ensembles. A common NLP technique is to train a variety of classifiers with different tagging conventions (see above) and average their results; the researchers observed that their neural net when initialized with different parameters often produced different results, so tried an all-neural net ensemble with 10 nets. However, the results did not improve much — certainly not enough to justify the 10x training time.
  5. Parsing. Another common NLP technique is parsing. Parsers like the Charniak parser identify the part of speech of each word in a sentence and construct a tree of those parts of speech, then progressively collapse branches of the tree until each other are chunked into branches of the tree. The authors tried combining parser information with the neural net by adding an additional lookup table for each part of speech identified by the parser (there are 40). Adding just the simplest layer of parser output boosted the performance of the net by 1.5% (adding higher level features of the parser’s ouput resulted in diminishing returns).
  6. Word representations. They tried using early versions of embeddings for each word and found a general boost in performance (and this was before the word2vec paper).

However, while the authors explored the additional benefit of these techniques on top of their primitive CNN, the paper is titled “Natural Language Processing (Almost) From Scratch”, and emphasized how much they could achieve without manual language feature engineering. So the baseline they establish does not include the use of the above techniques.

…back to CNNs for text classification

Anyway, I wish that the authors of the RCNN paper had mentioned that their results were already eclipsed by another, similar approach to CNNs, that was more up-to-date than the Collobert paper. Here is the conclusion of Convolutional Neural Networks for Sentence Classification, published in 2014 (a year before the RCNN paper) and now cited by 1010 other authors:

Which shows a better result on the Stanford Sentiment Treebank task (48.0) than the RCNN (47.21), the only NLP task these two papers had in common, despite the RCNN research being released a year later.

In all, I tend to agree with the very humble conclusion of the CNN for Sentence Classification paper:

“Despite little tuning of hyperparameters, a simple CNN with one layer of convolution performs remarkably well. Our results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP.”

Which is surprising to me, given how amazing RNNs seem given the classic Karpathy blog post (wish I could see how many times that’s been cited).

--

--