FastText: stepping through the code
Little disclaimer: some of the information in this blog post might be incorrect. It will most probably become out-of-date very soon too. In any case, I would appreciate any feedback!
FastText is a library developed by Facebook for text classification, but it can also be used to learn word embeddings. Since becoming open-sourced in 2016¹, it has been widely adopted due to its training speed as well as its high performance.
In spite of reading the (very sparse) documentation, I realised that there were many parts of the algorithm that were obscure to me so I decided to go through the exercise of trying to understand how the model works. I started by reading the main papers² and a lightning talk from the Stanford deep learning for NLP course³. Doing this only raised more questions than I had at first. For example: both the Stanford lecture and one of the papers talk about word n-grams, but do not mention character n-grams. Also, the command-line optional parameters includes something called bucket
and the only explanation given is “number of buckets”. How are the ngrams computed? The only way to find out was to go through the code in github.
In this blog post, I am only going to explore the supervised setting.
What I learnt
- how FastText uses character n-grams, as well as word n-grams,
- that FastText supports multilabel classification, which is not obvious from reading the documentation,
- The number of parameters in the model.
Introducing the model
According to the main paper, the model is a simple neural network with only one layer. The bag-of-words representation of the text is first fed into a lookup layer, where the embeddings are fetched for every single word. Then, those word embeddings are averaged, so as to obtain a single averaged embedding for the whole text. At the hidden layer we end up with n_words x dim
number of parameters, where dim
is the size of the embeddings and n_words
is the vocabulary size. After the averaging, we only have a single vector which is then fed to a linear classifier: we apply the softmax over a linear transformation of the output of the input layer. The linear transformation is a matrix with dim x n_output
, where n_output
is the number output classes. In the original paper, the final log-likelihood is:
where
- x_n is the original one-hot-encoded representation of a word (or n-gram feature),
- A is the look_up matrix that retrieves the word embedding,
- B is the linear output transformation,
- f is the softmax function
This is all great, but how is this implemented in the code?
Stepping through the code
The input file is formatted in a way that each line starts with the label: __label__0
following by a sentence, i.e.
__label__cat This text is about cats.
__label__dog This text is about dogs.
The input file path is passed as an argument to the train
function. This function starts by initialising and populating a variable called dict_
.
dict_
is an instance of the class Dictionary
.
As we read and parse each sentence of the input file, two main vectors are populated:
words_
, which contains all the unique words found in the text,word2int_
, which has the mapping of the hashes of each of these words to their position in thewords_
vector. This is important because the indices will be used to look up the embeddings matrix A.
The vector words_
contains instances of entry
. Each entry can be of type word
or label
, and has a count associated to it. Then, there is also the subwords
field, but we’ll have to wait and see what that is.
When adding words/labels to the word2int_
variable, collisions get resolved so that two different words never map to the same index (see while-loop below).
Both vectors are later filtered to make sure that only the words and labels with a minimum number of occurrences are included. Then comes the interesting part of the readFromFile
function: the function initNgrams
is called. We’re about to solve the mystery of the n-grams.
The function initNgrams
below finds all the combinations of character n-grams up to maxn
in a word and adds it to the ngram
vector. Finally, the hashes of each n-gram is computed and stored in the ngrams
vector, modulo bucket
size. In other words, the character n-gram hashes are added “after” the word hashes, so that the character n-gram indices do not collide with the word indices. Character n-gram hashes can collide among themselves though!
So for each word in the text, subwords
represents… the character n-grams! The embedding matrix A now maps:
- the first
nwords_
rows to the word embeddings for each word in the original vocabulary, - the following
bucket
rows to the embeddings of the character n-grams.
So we have resolved the mystery of the character n-grams. Now… what about the word n-grams? Bear with me, we’re getting there!
Going back to the main train
function, the next steps are:
This is where the embedding lookup matrix A gets initialised. If some pretrainedVectors
are passed to the function, these are loaded. If not, the matrix is initialised with random numbers between -1/dim
and 1/dim
, where dim
is the embedding size. The dimension of this matrix is (n_words + bucket ) x dim
because we are going to fit embeddings for each word, as well as for the bucket
character n-grams that we saw earlier. The output is also initialised in this step.
Finally, we get to the part where we will start training the model. This happens in threads, which we will not get into.
In the supervised part of the code, two things happen. First the getLine
function is called using the dict_
variable. Secondly, the supervised
function is called.
In the function above, we read one sentence from the input file stream, word-by-word finding the index of each word from the word2int_
vector. We add the subwords (remember, character n-grams) to the words
variable, like we saw before. And finally, we see something else, which is that the labels are also read and put into a vector called labels
.
After the whole sentence is read into the appropriate vectors, we finally get to the part of the code that deals with the word n-grams: the addWordNgrams
function!
This is where we finally disentangle the mystery. hashes
represents the hashes of each word we have encountered in the text, whereas line
is the sentence word & character n-gram counts. n
is the wordNgrams
parameter which specifies the maximum length of word n-grams. Each word n-gram gets a final hash computed through the recursive formula: h = h * 116049371 + hashes[j];
. This formula is the FNV hashing algorithm applied to a string: it takes the hashes of each word in the word n-gram and combines them into a single integer. By doing this, each word n-gram would get mapped to a unique hash value. Finally, this value (which can become very large) is passed through modulo the number of buckets.
So here is how the word n-grams are computed then: similarly to the character n-grams, but not quite, because we’re not hashing the actual word n-gram here. I wonder why this choice was made in the code…
After the sentence is read, the function supervised
is called next. If the sentence has multiple labels assigned to it, we select one of the labels randomly in order to update the model.
Finally, inside the model update function…
The input
variable passed to supervised
has the list of all the indices of each of the features (words, character n-grams and word n-grams) in the sentence. The target is the index to the output class. The computeHidden
function finds all the embeddings for each of the input features by looking up the embedding matrix wi_
and averages them together (by summing with addRow
, and dividing by its size). Once the hidden_
vector is modified, it is passed onto the softmax
function (default loss function) alongside the target label.
This is where the softmax is applied using the parameters of the output layer and the input vector, and the log-loss is computed and returned by the softmax function.
Using this hashing trick means that the number of model parameters does not grow with the size of the word/char n-grams, and is capped by the number of buckets
.
For example, FastText does not use word or character n-grams by default (so bucket
is 0 by default), but say we were to use the following parameters:
bucket = 2000000
with bothwordNgrams > 1
ormaxn > 0
dim=100
n_output=2
n_words=500000
The final number of model parameters to learn would be (500000 + 2000000)*100 + (2 * 100) = 250,000,200
. As we can see, even though the implementation is relatively simple, it is still very large in the number of final parameters, something typical in deep learning methods. This uses some rough estimates, such as the fact that we would see 2000000
word/char n-grams, but it’s just to give an order of magnitude. Model complexity can be reduced by tweaking some of the parameters, such as bucket
or the sampling threshold.
Ok, that’s it for now! I hope you enjoyed this post :-)
¹https://research.fb.com/fasttext/
²https://arxiv.org/pdf/1607.01759.pdf and https://arxiv.org/pdf/1607.04606v1.pdf
³ https://www.youtube.com/watch?v=isPiE-DBagM&index=5&list=PLU40WL8Ol94IJzQtileLTqGZuXtGlLMP_ (from 47:39)