Diving Into Word Embeddings with a New Model | Towards AI
Traditional word embeddings are good at solving lots of natural language processing (NLP) downstream problems such as documentation classification and named-entity recognition (NER). However, one of the drawbacks is a lack of capability on handling out-of-vocabulary (OOV).
Facebook introduces Misspelling Oblivious (word) Embeddings (MOE) which overcomes this limitation. MOE extends fastText architecture to achieve it. Therefore, this story goes through the fastText training method and architecture before talking about MOE.
Skip-gram with Negative Sampling (SGNS)
fastText extends word2vec’s architecture which uses skip-gram with negative sampling method to train a word embeddings. Skip-gram uses context words to predict surrounding words in order to learn text representation (aka embeddings). Negative sampling method is a way of picking a false case for the aforementioned training. For more detail, you can check out these posts (skip-gram and negative sampling) for more information.
The following figure shows the different training methods in word2vec. Continuous bag-of-words (BOW) leverages surrounding words to predict context words while Skip-gram uses context words to predict surrounding words.
fastText follows SGNS idea with little modification. One of the characteristics of fastText is subword. N-gram method uses to split the word into subword. For example, the range of n-gram character is between 3 and 5. We can split
anana. Meanwhile, the embeddings of
banana as a sum of these subword embeddings.
The training objective of fastText classifies the label. Model inputs are n-gram features (i.e. x1, x2 … xN). Those features will be averaged in the hidden layer and feeding into the output layer eventually.
Misspelling Oblivious (word) Embeddings (MOE)
MOE further extends the fastText idea by introducing spell correction loss. The spell correction loss targets to map misspelled words embeddings close to their embeddings of correctly spelled variants. The spell correction loss is a typical logistic function. It is a dot product between the sum of input vectors of the subwords of a correct word and misspelled word.
The following shows that the embeddings of
bird (correct word) and
bwrd (misspelled word) are close together.
- Subword is a powerful way to handle misspelling words and unknown words. MOE uses n-gram character to build a subword dictionary while other state-of-the-art NLP models (e.g. BERT, GPT-2) use statistics ways (e.g. WordPiece, Byte Pair Encoding) to build the subword dictionary.
- Handling unseen words is a critical advantage in many NLP systems. For example, the chatbot needs to handle lots of new vocabulary whenever it is misspelled or newly word.
Like to learn?
I am a Data Scientist in the Bay Area. Focusing on the state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or Github.
- A dataset of misspellings to train MOE.
- Story about negative sampling method
- Story about fastText
- Official page for fastText.
- T. Mikolov, G. Corrado, K. Chen and J. Dean. Efficient Estimation of Word Representations in Vector Space. 2013.
- A. Joulin, E. Grave, P. Bojanowski and T. Mikolov. Bag of Tricks for Efficient Text Classification. 2016
- B. Edizel, A. Piktus, P. Bojanowski, R. Ferreira, E. Grave and F. Silvestri. Misspelling Oblivious Word Embeddings. 2019