Creating Semantic Representations of Out of Vocabulary Words for Common NLP Tasks
Why Word Embeddings?
Machine learning algorithms need numbers. When you want to train on data that includes words, you have a couple of ways to turn them into numbers. A functional, but limited method, is to use a unique integer for each word in your vocabulary. However, a single number fails to capture the rich meaning and context that we recognize when we read a word. In order to represent this semantic information, it has become common to use word embeddings to train NLP machine learning tasks. Word embeddings are a list of weights that are learned for each word or phrase from training an algorithm like Word2Vec on a large body of text. This representation performs the same task as the single integer, but provides a lot more information for your network to train on.
This works great so long as you have an embedding for every word you encounter in your new machine learning task. However, usually there are domain specific terms, like product names or certain technical terms, that were not in the training corpus. In that case, you have a couple of bad alternatives. You can either ignore the term, losing information that is probably relevant, or you can create a random embedding. A random embedding can be worse than ignoring the term, because it can imply semantics that don’t exist. You can limit the domain of the random embedding so that it doesn’t overlap with any of your other, real embeddings, but this also is imperfect and skews your results in undesirable ways. Another option might be to use something like NLTK’s Wordnet to look for synonyms that have word embeddings. But this too misses many cases and adds additional compute time and complexity.
Ideally there would be a way to create reasonable embeddings for new, out of vocabulary terms in real-time. Recently, Facebook Research released fastText, which can be used to learn word embeddings and perform sentence classification. Importantly, it also has the ability to produce embeddings for out of vocabulary words.
It’s able to do this by learning vectors for character n-grams within the word and summing those vectors to produce the final vector or embedding for the word itself. By seeing words as a sum of parts, it is able to predict representations for new words by simply summing the vectors for the character n-grams it knows about in the new word. This also lets it better understand the semantic similarity between words like rarity and scarceness. The character n-gram rare is very similar to scarce and -ity is similar to -ness. By breaking up words into parts like this, it’s able to model the similarity between these two words based on the similarity of its sub-grams, even if the exact words themselves have never been encountered in the training set — or not encountered frequently enough to derive reliable representations.
This has useful implications for a language like German, which makes extensive use of compound nouns. Take a word like Bezirksschornsteinfegermeister — which means “head district chimney sweep”. The whole word would likely be very uncommon in a corpus, but the parts, like meister, could be very common. The character n-gram technique would excel in cases like this where a standard word level embedding might struggle.
This also works for something like a product name. You might have a product called the SX10 in your training set. In your test set, maybe you have another product called the SX20 that did not appear in the training set. The fastText model can learn a representation for the character n-gram SX during training, so that it can create a run-time vector for SX20 that encodes a similarity to SX10. This happens despite the fact that the algorithm never encountered the term SX20 in the training step. Standard word level embedding algorithms would not return a vector for SX20 at all, and so your NLP task would miss the semantic impact of the term.
Roll your own
We trained a model using the fastText library on the Wikipedia text corpus. Training your own embeddings is as simple as cloning the repo, downloading your text corpus and running something like:
./fasttext skipgram -input corpus.txt -output model
Once your model has been trained, you can later produce representations of new, never seen before words, by running:
./fasttext print-word-vectors model.bin < queries.txt
Assuming queries.txt contains your list of new words, this will write a word vector for each word, one per line, to standard output.
You can also use this model with the Gensim library to produce word embeddings on demand in Python. Let’s make up a word like “ubercodering” — which I guess might mean “behaving in the manner of a really good programmer”. We can get an embedding for this new word by doing something like:
from gensim.models.wrappers import FastText
fasttext_model = FastText.load_fasttext_format(‘model’)
Note: the ‘model’ string is the path to the fastText model that you trained. To get this to work, the model.bin and model.vec files must be available in that directory. You omit the extension in the method call (both files are used).
We’re still evaluating the accuracy of these out of vocabulary representations, but it seems superior to using random vectors or ignoring the word entirely in our early testing. It’s certainly something to try when you are faced with the same problem.
About Cisco Emerge
At Cisco Emerge, we are using the latest machine learning technologies to advance the future of work.
Find out more on our website.