Word Embeddings

Matthew Kramer
CodeX
Published in
4 min readAug 31, 2021

What is it?

A word embedding is a representation of a word as a vector, or sequence of numbers. Often times these vectors encode how the word is used in conjunction with other words in a dataset. Both the technique for encoding and the dataset used can vary greatly and ultimately depends on the appropriate use case.

Why is it useful?

Word embeddings have ubiquitous use cases in NLP/ML, and allow computers or mathematical equations to reason about words. Computers only see words as a sequence of individual characters, which is not often useful when reasoning about the semantic or syntactic usage of a word in a language. Word embeddings can give computers additional information about a word, and its usage with other words in sentences.

Many of the big tech companies such as Google and Facebook have invested heavily in developing state-of-art techniques for creating these embeddings.

How is it done?

The process to encode words as vectors is usually done by building up a repository of word to vector mappings, and then referencing this repository to get a specific word’s embedding. This repository is created by training some model on a training corpus.

Bag of Words / One hot encoding Model

A simple way of representing words as vectors is to assign every possible word in the dictionary, or vocabulary, a unique identifier. For example, if our vocabulary contains 5 words: [ dogs, are, better, than, cats ] then we could assign a unique integer to each word:

  • dogs — 1
  • are — 2
  • better — 3
  • than — 4
  • cats — 5

To then encode a word as a vector, construct a vector the size of the vocabulary containing 0s in every entry. Then set the index that corresponds to the word’s unique identifier to 1. So the word ‘better’ would have the vector [ 0, 0, 1, 0, 0 ]. We can then represent sentences or even entire documents as an embedding by adding all the word vectors for that document together. For example, if our document is just ‘cats are better’, then the document’s vector would be [ 0, 1, 1, 0, 1 ]

Pros

  • Easy/simple to implement
  • Quick processing/training time
  • Works well for small datasets where other embedding techniques don’t have enough data to train on
  • Works well for small domain-specific datasets where other embedding models trained on large general datasets don’t capture the context of the more niche dataset

Cons

  • Does not account for ordering of words
  • Word vectors are extremely large — often requires heavy processing of document vectors to reduce dimensionality

Word2Vec Models

Word2Vec is a popular family of word-embedding algorithms released in 2013. There are two main algorithms that can power word2vec: Continuous Bag of Words and Skip-gram.

Production Use Cases

  • Music recommendations at Spotify and Anghami
  • Listing recommendations at Airbnb
  • Product recommendations in Yahoo Mail
  • Matching advertisements to search queries on Yahoo Search

Continuous Bag of Words (CBOW)

The continuous bag of words model uses a feed-forward neural net with a single hidden layer to classify what a word in a sentence will be based on what the surrounding words are. If an example input is “I think [???] I am”, then the output of the model should be “therefore” as this would be the most likely choice given the context words “I think” and “I am”.

When we train this neural net to suggest words based on context words, the neural net learns a number of weights. In fact, these weights directly correspond to vectors for each word in the vocabulary. These vectors are then extracted from the neural net model and used to embed any word as a vector.

Pros

  • Encodes nearby usage of other words into a word’s embedding
  • Quicker to train than skip-gram model
  • Captures syntactic usage of words well: the word ‘pickle’ would be close to ‘pickled’

Cons

  • Limited to window size for context words
  • Does not capture semantic usage of words as well: the word ‘car’ would not be as similar to ‘boat’.
  • Prone to overfit frequent words
  • Cannot embed words the model has not seen yet

Skip-Gram Model

The skip-gram algorithm can be thought of the inverse of the CBOW model. Instead of classifying a word based on the context words, skip-gram attempts to classify the context words based on a single input word. For example, if the input is the word ‘therefore’ and the window size is 5, then the output should be [I, think], [I am]. Similar to CBOW, a feed-forward neural network is trained to do this task, and we then extract the weights of the neural net as the word embeddings.

Pros

  • Encodes nearby usage of other words into a word’s embedding
  • Captures semantic usage of words well: the word ‘car’ would be close to ‘boat’
  • Not as prone to overfit frequent words as CBOW

Cons

  • Limited to window size for context words
  • Slower to train than CBOW
  • Does not capture syntactic usage of words as well: the word ‘pickle’ would not be as similar to ‘pickled’.
  • Cannot embed words the model has not seen yet

Resources

--

--