Recurrent Convolutional Neural Networks for Text Classification

Overall thoughts: Research and training oversights make it hard to trust these results.

Published in

Paper Club

6 min readAug 3, 2017

Background Summary

The development of word embeddings has allowed neural networks to make large advances in NLP related tasks. Embeddings are superior to previous features used in text processing such as Bag of Words. Recursive Neural Networks capture information about sentences in trees but are inefficient to construct O(n²). Recurrent Neural networks capture contextual information by maintaining a state of all previous inputs. The problem with RNNs is that they’re a biased and favor more recent inputs

Is the claim that RNNs are biased favoring later inputs even true? In theory it seems like the RNN should learn to recognize that a certain type of sentence was important and would “remember” this until the end.

CNNs can learn important words or phrases through selection via a max pooling layer. However processing text is difficult with CNNs because learning an optimal kernel size is challenging.

If this is a known difficulty, the author should provide a reference to a paper that shows that.

Specific Questions

Can we build a model that classify’s documents by category better than existing models?
Can we build a model that analyzes documents sentiment better than existing models?

Methods

The core of the model is creating a word representation(y¹) that consists of the left side context, word embedding, and right side context.

The Left context is constructed from a forward RNN, and the right context is structure from a reverse RNN.

y² represents the result of the word representation passed through a standard neural network layer—weight matrix multiplication plus a bias term passed through a tanh activation function.

y² = tanh(W*y¹ + b)

The max pooling operation takes the most important feature from each word representation. The final layer is computer with W*y³ + b and passed through a softmax activation function for classification.

The training target of the network attempts to maximize the log likelihood of a given class. The weights of the network are initialized from a uniform distribution where the maximum value is the square root of the fan-in.

The skip-gram model is used to pre-compute the word embeddings

Why are they training the word embeddings themselves as opposed to using open source pre-trained word embeddings like word2vec?

Results

The model was tested against several well known datasets and compared against frequently used models as well as current state of the art approaches.

Both Neural Network approaches outperform clearly outperform the baselines. The authors claim that Convolutional(CNN and RCNN) approaches are better than RecursiveNN approach because the RecursiveNN’s ability to predict the correct sentiment relies on a properly constructed tree to represent the sentiment. Since tree construction is O(n²) performance is a main limiting factor. The authors follow on to say that training time is 3–5 hours for RecursiveNN and several minutes for their RCNN.

The authors really need to show some data here to back up their claim. “3–5” hours is a huge window, a graph showing training epoch vs time would help. It’s also unclear what exactly went on during training. How many epochs did they train for? Did they use early stopping? Was the loss function plateauing for the RecursiveNN (and thus more training likely wouldn’t help) or was it just taking a long time to train(their claim)?

On all four datasets, the RCNN approach was superior to the CNN approach. To test this further, the authors tried various kernel sizes on the CNN.

The authors say that small window sizes lose out on picking up long distance patterns, while larger window sizes suffer from data sparsity. It seems that regardless of kernel size, the RCNN approach is superior.

It’s at this point in the paper that I start to mistrust the authors experiment. I was happy to see them use multiple window sizes but thought the omission of stacking convolutional layers was a significant oversight. Their claim that small window sizes lose information about far away patters is true, however, this is the purpose of stacking multiple convolutional layers. Early layers pick up patterns that are close together and the later layers see patterns that are far apart

In Both the 20New and Fudan datasets, the authors achieve state of the art results.

I thought their tests against a CNN were tenuous and after 5 minutes of research I was able to find a paper published 1 year before this paper, that uses the same dataset (SST) and achieves results that outperform their RCNN (48.0 score).
By this point I’ve lost the authors trust, it’s hard for me to believe their results. It’s possible that the RCNN really is state of the art but given the experimental and research oversights, I feel more studies in this area are necessary before any conclusions can be drawn.

Questions

We compare our RCNN to well-designed feature sets in the ACL dataset— How do you compare feature sets?
The magnitude of the maximum or minimum equals the square root of the “fanin”(Plaut and Hinton 1987). The number is the network node of the previous layer in our model. The learning rate for that layer is divided by “fan-in”. ??

Viability as a project

Given the standardized datasets this could definitely be reproduced as a project. I think the interesting part about doing this as a project would be to use the 2014 paper’s CNN as a comparison against the proposed RCNN and see if their results hold.

Words I don’t know

Data sparsity problem— Problem that arises in ML when there isn’t enough training data to adequately model the phenomenon. This occurs a lot in NLP where a given set of training data is likely to not include a lot of words and the words included could be combined in ways that convey a different meaning than they do in the training set.
Support Vector Machine— A popular supervised learning model used for binary classification.
Naive Bayes—Classification algorithm popular for document categorization. The algorithm makes the assumption that all features are independent of one another.
Logistic Regression— Classication used for binary categories where a feature set maps to one of two mutually exclusive categories. Using multinomial regression it is possible to extend this to more than 2 categories.
tf-idf—Term Frequency-Inverse Document Frequency. Weights terms by how frequently they appear in the document but is offset by how often they appear in the entire corpus. This normalizes words that appear more often in general as opposed to rare words. Most text recommendation systems use tf-idf as a weighting mechanism.
LDA—Latent Dirichlet Allocation. A statistical model that allows for different observations to be explained by unobserved groups within an observation. For example, textual classifications might share common words and be different in more specific words.
Tree-kernel—??
Recursive NN—Neural network created by applying the same set of weights recursively to produce a structured prediction.
low-resource languages—languages where there is not a lot of data available. Either due to the language being esoteric or simply not having data available online.
Macro-f1—accuracy score that includes both precision and recall.
Precision/Recall— In the context of document retrieval:

L1 Regularization—Regularization technique that adds sum of weights to loss function. Recall L2 norm adds sum of squares of the weights to the loss function.
dense vector—Vector that does not contain a lot of 0’s
log likelihood—Function that seeks to maximum the probability of several independent events occurring.
Curse of dimensionality—as the dimensionality of a model grows, it’s ability to recognize more complex patterns increases however the amount of data needed to train such a model grows exponentially with respect to the dimensionality.
skip gram model—Model that takes a word as input and predicts the context of a word—words that appear before and after—as output