Notes on Bag of Tricks for Efficient Text Classification
Overall impression: This paper functioned well as a bite-sized introduction of a new architecture that delivered exactly what was promised: a bag of tricks for efficient text classification. It timidly reaches for larger conclusions, but stops short of anything concrete.
⁉️ Big Question
How can we improve the accuracy and speed with which computers extract meaning from text?
🏙 Background Summary
Traditional text classification methods are centered around linear classifiers, which scale to large corpuses and approximate state-of-the-art results when tuned correctly. Recent approaches have been centered around neural networks, which achieve noticeable improvements in accuracy but are slower to train and test.
❓ Specific question(s)
- How can we raise the bar which neural network test classifiers are compared against?
The authors will attempt to improve on the performance of basic linear classifiers with the key features of rank constraint and fast loss approximation.
Interesting, and kinda tricky, how the word “performance” is overloaded. Based on context it could either be “how good this model is” or “how fast this model trains”.
The ngram features of the input are first looked up to find word representations, then averaged into hidden text representations, which go into a linear classifier and finally a softmax output.
The classifier trains on multiple CPUs with SGD and a linearly decaying learning rate.
Two nuances that are employed in this architecture are the hierarchical softmax function to improve performance with a large number of classes (training time goes from linear to logarithmic since the classes are arranged into a tree), and (when using n-grams) the hashing trick to manage mappings of n-grams to local word order.
This is likely my own naiveté with regards to machine learning techniques outside of neural networks, but is it standard for them to train on CPUs? Is it specifically appropriate for text problems to avoid the overhead of transferring data to and from the GPU?
The authors constructed a model with the above architecture and evaluated it on sentiment analysis and tag prediction. They compare their results to various machine learning techniques, including several neural network architectures.
What does h stand for? I don’t see them explain it earlier in the paper, and they only refer to it as “hidden units” later on. I also kinda equate hidden layers with neural networks…so is fastText just a really small neural network?
These are the results and training times for sentiment analysis tasks. The fastText model trains an order of magnitude faster than the neural network implementations, while achieving parity on accuracy.
And these are the results for the tag prediction tasks. With the large data corpuses, these tasks demonstrate the scalability of fastText compared to the neural network. There is a significant speed-up, again without much compromise in accuracy.
Appreciate the data on actual training times, actual hyperparameters and dataset sizes, etc. Helps give a much more holistic and trustworthy picture of the process
The authors note that performance could be further improved using pre-trained word embeddings.
The results do answer the specific question, and they even go beyond to question the premise that a baseline necessarily will perform worse than the candidate(s).
The authors show that the fastText architecture provides comparable accuracy to neural networks while being orders of magnitude faster to train and test.
This served as a good reminder to me that bigger is not always better. Many of the advancements in neural networks I’ve been exposed to have centered around wider, deeper, more complex architectures. However, we have to constantly question whether adding these types of overhead is worth the performance tradeoff.
They conjecture that this demonstrates that text classification problems might not be the best domain for neural networks in practice, despite the theoretical demonstration of their higher representational power.
I think this paper functioned well as a bite-sized introduction of a new architecture that delivered exactly what was promised: a bag of tricks for efficient text classification.
I wish the authors proposed concrete next steps beyond “more research is needed to evaluate the practicality of neural networks for these tasks”. It feels like they started off with the goal of a new baseline for text classification, got better results than they expected, and only put minimal effort into trying to draw conclusions; I’m unsure if this is for the better or worse.
⏩ Viability as a Project
The fastText architecture could certainly be applied to the Personalized Medicine competition on Kaggle (https://www.kaggle.com/c/msk-redefining-cancer-treatment), but it loses a lot of the scalability benefits since the data corpus is quite manageable in this competition. I would more likely revisit this for future projects with huge datasets.
The authors released their code here and encouraged others to build on it; there’s not much interest on my end, but there are plenty of extension points for this research.
The abstract does match the paper. To nitpick, I’d say that the statistics presented in the abstract don’t mean much out of context, but this context is available later in the paper.
🗣 What do other researchers say?
- Smerity on Hacker News points out that Vowpal Rabbit, which is not mentioned in the paper, has achieved very similar performance to fastText
🤷 Words I don’t know
- rank constraint: a cap on the rank (dimension of vector space spanned by columns) of a matrix
- loss approximation: a heuristic approach to calculating the loss function, exchanging more speed for less accuracy and precision
- hierarchical softmax: activation function that improves computational complexity of the softmax function, which is linear, to log time
- hashing trick: fast and space-efficient way of vectorizing features by using the hash values of the features as direct indices into the vector