Bag of Tricks for Efficient Text Classification

Overall thoughts: Well written, reproducible, rigorous analysis compared against strong baselines.

James Vanneman

Published in

Paper Club

5 min readAug 17, 2017

Background Summary

Neural networks are typically a good choice for text classification problems in NLP. They tend to perform very well however they are slow to train to their usefulness on large datasets is limited. Linear classifiers can do very well on text classification problems if the right features are selected. However these classifiers are limited because linear classifiers(e.g. SVM) don’t share parameters between features. When the output space is large, this prevents the classifiers from leveraging information from other parameters and generalizing well. Work that combines the two approaches, neural networks that train quickly on large datasets using linear connections has not yet been explored.

Specific Questions

Can we create a neural network that trains quickly on large datasets yet still provides strong baselines comparable with state of the art results?

Model

The model, fastText. The set of features x are made up of N ngram features in the sentence. The advantage of using ngrams is that you capture information about local word ordering.

For example, bag of words is ngram with N = 1 and completely disregards ordering. An ngram with N = 2 would take into account words that are adjacent to each other. Higher values of N are more computationally expensive but capture larger amounts of information about ordering.

Are the ngrams similar to word embeddings but ngrams > 1?
what is a hidden variable?
Does each ngram have the same vector space?
How do you compute ngram vectors?
Are they taking a vector for each ngram in the sentence, summing them all and dividing by total number of ngrams?
What do they initialize the ngram vectors to?
Is the error back-propagated to the ngrams?

The negative log likelihood is minimized with the following function

Xn is the normalized bag of features for the nth document, A and B and weight matrices and yn is the label

If Xn is a bag of features, why is it multiplied by A (a lookup table over the words)?

The output is passed through a hierarchical softmax classifier to improve runtime of the model. A typical softmax classifier has complexity O(kh) where k is number of classes and h is the dimensions of text representation. This makes sense when you consider the matrix needed to map the text representation to the classifier is of size k*h. A hierarchical softmax allows you to reduce this complexity to O(h*log2(k))

Experiments

Experiment 1: Sentiment analysis

The authors follow the evaluation protocol of Zhang et al. (2015). They evaluate against 8 datasets and include four other architectures also tested against those datasets.

I like how they’re making the comparisons here. The authors are using a bunch of different datasets and reporting on (recent!) experiment results from other authors on those same datasets. This makes it much easier for the reader to follow the authors conclusions.

The results indicate that fastText is competitive, beating char-CNN, char-CRNN and not quite as good as the VDCNN

Why did they use ngrams of 2 above but ngrams of 5 in figure 3 data set?
Comparison against a different papers on different datasets validates their conclusions that the results are competitive on performance and computationally much more efficient

ngrams of 5 are used to achieve the results below.

The authors claim that the 1% difference between their model and tang et al. (Table 3) is due to them not using pre-trained word embeddings. They should use pre-trained word embeddings and compare before they make this assertion
I agree with the authors that these results clearly indicate that fastText’s performance is comparable

Comparing training speeds shows that fastText trains significantly faster than the other models.

Experiment 2: Tag prediction

The authors used 100M images with titles, captions and tags. They trained their model by using the titles and captions to predict the tags. They have also released a script that builds the data they used so you can reproduce their results.

Yay, reproducible results! Huge win. I also agree with the authors that this comparison shows a huge reduction in computational cost for their model

The authors compare against a different model for predicting tags called tagspace and show considerable improvements in accuracy when using bigrams and much faster performance

Why not use larger than N= 2 ngrams to improve accuracy?
Are there other models besides tagspace used for this type of classifciation? Is tagspace state of the art?

Additional Questions

What is data augmentation in the context of text classification? “For char-CNN, we show the best reported numbers without data augmentation.”

Why can’t word2vec features be averages together to create sentence representations? “Unlike unsupervisedly trained word vectors from word2vec, our word features can be averaged together to form good sentence representations”

Why don’t linear classifiers share features among classes? “However, linear classifiers do not share parameters among features and classes.”

What do they mean by “factorize the linear classifier into low rank matrices”

This would be something like PCA that takes higher ranks matrixes and approximates them in lower dimensional spaces

Words I don’t know

Efficient word representation—building up vector representation of words with a low computation cost
rank constraint— Using low dimensional matrixes to approximate higher dimensional spaces.
fast loss approximation—??
Hierarchical softmax — efficient softmax classifier that performs much better than a regular softmax
character level convolutional model—one hot encode each character, combine the encodings to form a matrix and use this matrix as the input to the convolutional layers