FastText — a solution for classification with a large number of labels

Rémi Denoyer
5 min readOct 5, 2022

--

Fig 0. Illustration of “choice”

Abstract

The problem — users of an online mentorship program are providing text data about their expectations from the mentorship relationship. This data has been classified: one or more labels are associated with each text. Goal is to select labels for new data points.

The study — This classification problem is challenging because there is a small number of existing data points (20,000) and 150 labels. This study is comparing performance of selected machine learning algorithms for this task. Models tested are the logistic regression, FastText and a custom neural network, and the classification results are shared

The results — FastText outperforms other models for the task. End of the study briefly presents the model selected.

1. Objectives and data set

a. Dataset

The dataset contains 20,000 datapoints; each datapoint is containing two strings: a one-liner of around a hundred characters and a description of various size but that is never exceeding a few thousands characters. Each datapoint is associated with one or more of one hundred and fifty labels. Each label is a short descriptive sentence.

Fig. 1 Examples of labels

b. Objective

The goal is to associate new datapoints with labels. The model will compute a score for each label and selects the labels most likely to be matched with the input.

c. Data pipeline

Data preparation is made using nltk and nlp python libraries to remove english stop words.

Fig. 2: My helpers for this NLP challenge

d. Test procedures

A classic approach of predicting the best label does not work here due to the large number of labels. Output of the prediction is a list of labels ordered by probability. Test will be valid if one of the top 5 matches the labels associated with the datapoint.

2. Models tested

a. Vanilla logistic regression with bag of words

As a reference of the most naive approach, we are training a logistic regression on word frequency vectors. Using more helpers described in Fig. 1, the training pipeline is on Fig. 3.

Fig. 3: Multi-class word frequency vector logistic regression

Important notes:

  • The word frequency embedding is specific to this model in this experiment but any model could work. It is possible to refine the model using Word2Vec (e. g. GloVe) or any custom embeddings
  • This is a multi-class logistic regression model. Referencing this documentation, the LogisticRegression class is configured for multinomial logistic regression: multi_class is set to“multinomial” and solver to a solver that supports multinomial logistic regression, such as Limited-memory_BFGS.

All supervised learning NLP experiments will follow the same logic: get a corpus of documents of roughly the same class, clean this corpus, prepare the data for the model selected, train the model, play with the model

b. FastText

Now that the NLP flow is clear, let’s play with more advanced models!

FastText is an open-source library, developed by the Facebook AI Research lab. Its main focus is on achieving scalable solutions for the tasks of text classification and representation while processing large datasets quickly and accurately.

It relies on a more advanced vectorization based on bag of words (similar to our regression above) and word n-grams. The computation is still very fast thanks to the use of a hierarchical soft max. I describe these features more in details in the part 4.

In practice, to train a fasttext model, the input must follow a specific format:

  • Input is a text file
  • Each element is a string of: 1. the document, a space, the __label__ substring directly prefixing the class name, a line break. (see Fig. 4)
Fig. 4: Fasttext training

3. Results

The vanilla bag of words approach is easy to use and beats random results. However, the accuracy is low and some of the results are very far from reality. The most represented classes are returned too often.

We know for a fact that Neural network performs best for this task, but the training required for building a custom one is out of the scope of this experiment

FastText results are very satisfying, with an accuracy of 68% for the top 10 classes and more importantly, intelligent and understandable results for most of the inputs

Fig. 5: Examples of fasttext results

4. Model selected: FastText

FastText demonstrate a far better accuracy than the linear approach (logistic regression) and training and testing times far smaller than a fully customized Neural Network approach.

Literature tells us that this efficiency and accuracy is fueled by two main methods for the text classification task.

a. N-grams: key to model accuracy

Fig. 6: n-grams illustration

A N-gram is a word and its context i.e. the other words surrounding it. Getting those informations is crucial in text classification especially when the corpus is made of sentences semantically close. Here we are classifying challenges of tech managers, so we can expect some sequences to be extremely meaningful.

Example: the frequency of the word team is way less informative than the bigrams created by the words team + the following verb.

FastText incorporates a bag of n-grams representation along with word vectors which enables the model to have a greater understanding of the sentiments of each document of the corpus.

More on n-grams here.

b. Hierarchical soft max: efficiency booster

Fig 7: Example of where the product calculation would occur for the word “I’m” in a hierarchical softmax. Credits to Steven Schmatz

A softmax transforms a layer into a probability (R^n -> [0,1]) and is the most common activation function for multi-class classification.

Hierarchical softmax is a is an alternative to softmax that is faster to evaluate: it is O(log⁡n) time to evaluate compared to O(n) for softmax. It utilises a multi-layer binary tree, where the probability of a word is calculated through the product of probabilities on each edge on the path to that node. (Morin and Bengio)

Shorter trees are built for frequent classes and longer trees for rarer classes. FastText uses DFS along the nodes across the branches of the tree, so the low probability classes are discarded quickly.

For multi-class classification with a large number of classes, this approach results in significant gain in efficiency compared to other models.

Conclusion

NLP model selection relies on building a strong pipeline with well defined layers to avoid the notebook traps: use modules to get the data, to do a common pre-processing, then understand what input each model need and then compare results

NLP model selection always requires to keep in mind the specificities of each dataset and target results: here, the goal was to have labels close to the target not necessarily perfect and to get the verbs semantics which was achieved with n-grams

FastText outperforms other models for multi class classification with a large number of classes and is rather easy to feed and utilizes.

--

--

Rémi Denoyer

Data Lead @ Plato. Find me in SF or Paris depending on the weather.