Paper Summary: Enriching Word Vectors with Subword Information

Mike Plotz Sage
4 min readNov 24, 2018

--

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/13, with better formatting.

Enriching Word Vectors with Subword Information (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov

Here’s the summary of the fastText paper, as promised earlier today. This one builds on the second word2vec paper (summary).

Prior work in vector representations of words and phrases ignores the characters within the words, which means that these models ignore regularities such as prefixes and suffixes, roots, and compound words. Intuitively, it should be possible to build a pretty good representation of many words based on morphology, even those that don’t show up in the training corpus. It should also be possible to gain model accuracy by exploiting similarities across verb conjugations and noun declensions, especially for strongly morphological languages like Turkish or Russian.

This paper attempts to do this by dividing each word into character n-grams (3 ≤ n ≤ 6) and building a vector representation by summing the vectorized n-grams, plus a vector for the entire word. So for example where is divided (for n = 3) into <wh, whe, her, ere, and re>, plus <where>. Special characters for the start and end of the word are added, so the her in the middle of where is not the same as the full word <her>. This approach contrasts on the one hand with morphological decompositions (which attempts to identify prefixes and suffixes explicitly), and character-level RNNs on the other.

The model follows Mikolov 2013, training a binary classifier on positive examples from the central word’s context and negative examples sampled from the vocabulary. This paper doesn’t mention the distribution used to sample negative examples, but I’m guessing they continued with the previous paper’s ¾ power of unigram frequency. Each word again has an input and an output vector, with a comparison score between words calculated as the dot product of word A’s input and word B’s output (the paper erroneously refers to this as a scalar product, but *shrug* I guess). Scoring a word in its context is done by summing over its n-gram parts:

Here Gw is the set of n-grams of word w and vc is the (output) context vector. So we’re only looking at sub-word information in one direction, but I guess it all comes out in the wash — and this does seem more reasonable than comparing every n-gram in w to every n-gram in wc.

There’s a note here about hashing n-grams to integers to save memory, but there isn’t much detail, so I guess if you want to know how this works maybe look at the code.

One unusual thing about this paper — that makes me more inclined to trust its results — is that many results are negative or inconclusive or otherwise less than ideal. They come right out and say e.g. that fastText is actually a bit slower than word2vec. I think this means either that the authors didn’t feel bad about publishing these details as some authors might, or possibly that they pre-registered predictions (if only for themselves) and published the details on principle. Either way I’m impressed.

Anyway, there are too many details in the results section to go through everything, so let’s just look at the highlights:

  • They looked at performance on out-of-vocubulary (OOV) words by training two models, one replacing OOV words with null vectors (sisg-) and one computing OOV word vectors by summing their n-grams (sisg — Subword Information Skip Gram). As expected sisg outperformed sisg-, indicating that the subwords contained a useful signal.
  • The paper’s approach did no better than baseline for datasets with only common English words, but did do better on English with rare words
  • Also as expected, improvements were highest for “morphologically rich” languages and languages with many compound words (like German) (although for semantic analogies the approach actually harmed results for German and Italian — maybe this is a noisy signal)
  • Considerable improvement over baseline for small datasets, but seems to saturate quickly
  • 5- and 6-grams were most informative
  • LSTM language models were improved with sisg vectors
  • They did this clever qualitative analysis where they identified which n-grams contributed most to a word’s vector, which tended to decompose words in reasonably sensible ways (e.g. politeness → polite, ness>, eness>)

Coming up: going beyond unsupervised learning to generate word vectors as a side effect of a sequence-to-sequence task.

--

--

Mike Plotz Sage

yet another bay area software engineer • learning junkie • searching for the right level of meta • also pie