Introduction to Word Embeddings and its Applications

Intro to Word2Vec, Glove, ELMo and fastText

Aayush Dua
CompassRed Data Blog
10 min readAug 5, 2020

--

Software Galaxies Visualization of Glove Embeddings
Source: Software Galaxy Visualization of Word Embeddings by Andrei Kashcha

Before I give you an introduction on Word Embeddings, take a look at the following examples and ask yourself what is common between them:

  1. Using customer reviews or employee survey responses to understand sentiments towards a particular product or company.
  2. Using lyrics of songs you liked to recommend songs that are similar in contextual meaning.
  3. Using a web translation service, like Google Translate, to convert webpage articles seamlessly to a different language.

You have probably guessed this right. All of these applications deal with large amounts of Text. Now obviously it’s a waste of resources to spend manual labor in these applications which have millions of sentences or documents containing text.

So what can we do now? We could feed this to Machine Learning and Deep Learning models and let them learn and figure out these applications. But most of these models can’t be fed textual data since they can’t interpret it in a human sense. Generally, they require numerical representations to perform any task. This is where Word Embeddings comes into use.

In this article we are going to address the following:

  1. What are Word Embeddings?
  2. Why exactly do we prefer Word Embeddings?
  3. What are the different types of Word Embeddings?
  4. What are their Applications?

What are Word Embeddings?

Word Embeddings are a numerical vector representation of the text in the corpus that maps each word in the corpus vocabulary to a set of real valued vectors in a pre-defined N-dimensional space.

These real valued vector-representation for each word in the corpus vocabulary are learned through supervised techniques such as neural network models trained on tasks such as sentiment analysis and document classification or through unsupervised techniques such as statistical analysis of documents.

The Word Embeddings try to capture the semantic, contextual and syntactic meaning of each word in the corpus vocabulary based on the usage of these words in sentences. Words that have similar semantic and contextual meaning also have similar vector representations while at the same time each word in the vocabulary will have a unique set of vector representation.

Word2Vec Embeddings
Source: https://www.tensorflow.org/tutorials/representation/word2vec

The above image displays examples of words in vocabulary having similar contextual, semantic and syntactic meaning to be mapped in a 3-Dimensional Vector Space. In the above picture example of Verb Tense, we can observe that vector differences between word pairs: (walking & walked) and (swimming & swam) is roughly equal.

Why exactly do we prefer Word Embeddings?

Certain questions might have popped into your mind by now.

  1. What is the simplest way to represent words numerically and why isn’t that sufficient?
  2. If Word Embeddings are so complex, why do we prefer them to simpler methods?

Lets address these questions one by one.

What is the simplest way to represent words numerically and why isn’t that sufficient?

The simplest ways to represent words numerically is to One-Hot-Encode unique word in a corpus of text. We can understand this better with an example. Suppose my corpus has only two documents:

* The King takes his Queen for dinner.
* The husband takes his wife for dinner.

You might notice that these two documents have the same contextual meaning. When we apply one hot encoding to the documents, here’s what happens:

There are nine unique words in the document text, so each word will be represented as a vector with a length of 9. The vector will consist a "1" that corresponds to the position of the word in the vocabulary with a "0" everywhere else. Here is what those vectors look like:

Lets address some disadvantages of this method.

1. Scalability Issue- The above example contained only 2 sentences and only 9 words in vocabulary. But in a real-world scenario, we will have millions of sentences and millions of words in vocabulary. You can imagine how the dimensions of one-hot-encoded vectors for each word will explode in millions. This will lead to scalability issues when it is fed to our models, and in turn will lead to inefficiency in time and computational resources.

2. Sparsity Issue- Given that we will have 0’s everywhere except for single 1 at the correct location, the models have a very hard time learning this data, therefore your model may not generalize well over the test data.

3. No Context Captured- Since one-hot-encoding blindly creates vectors without taking into account the shared dependencies and context in which each word of vocabulary lies, we lose the contextual and semantic information. In our example, that means that we lose context between similar word pairs — “king” and “queen” are similar to “husband” and “wife”.

If Word Embeddings are so complex, why do we prefer them to simpler methods?

There are many other simpler methods than Word Embeddings, such as Term-Frequency Matrix, TF-IDF Matrix and Co-occurrence Matrix. But even these methods face one or more issues in terms of scalability, sparsity and contextual dependency.

Therefore, we prefer Word Embeddings since it resolves all the issues mentioned above. The embeddings maps each word to a N-Dimensional space where N ranges from 50–1000 in contrast to a Million-Dimensional Space. Thus we resolve scalability issues. Since each vector in the embeddings is densely populated in contrast to a vector containing 0’s everywhere, we have also resolved the sparsity issues. Thus the model can now learn better and generalize well. Finally, these vectors are learned in a way that captures the shared context and dependencies among the words.

Different Types of Word Embeddings

In this section, we will be reviewing the following State-Of-The-Art (SOTA) Word Embeddings:

  1. Word2Vec
  2. Glove
  3. ELMo

This is not an exhaustive list, but a great place to start with. There are many other SOTA Word Embeddings such as Bert (developed by Jacob Devlin at Google) and GPT (developed at OpenAI) that have also made advanced breakthroughs in NLP applications.

Word2Vec

Word2Vec is an algorithm developed by Tomas Mikolov, et al. at Google in 2013. The algorithm was built on the idea of the distributional hypothesis. The distributional hypothesis suggests that words occurring in similar linguistic contexts will also have similar semantic meaning. Word2Vec uses this concept to map words having similar semantic meaning geometrically close to each other in a N-Dimensional vector space.

Word2Vec uses the approach of training a group of shallow, 2-layer neural networks to reconstruct the linguistic context of words. It takes in a large corpus of text as an input and produces a vector space with dimensions in the order of hundreds. Each unique word in the corpus vocabulary is assigned a unique corresponding vector in the space.

It can be implemented using either of the two techniques: Common Bag of Words(CBOW) or Skip Gram.

a) Common Bag of Words
This technique uses the shallow 2-layer neural network to predict the probability of a word given the context. A context can be a single word or a group of words. The following diagram illustrates the concept:

Source: word2vec Parameter Learning Explained by Xin Rong

The input will be the context of words each of them being one-hot-encoded and fed to the network and the output is the probability distributions of each word in the vocabulary.

b) Skip Gram
The skip gram is the flipped version of CBOW. We feed the model a single word and the model tries to predict the words surrounding it. The input is the one-hot-encoded vector of the word and the output is a series of probability distributions of each word in the vocabulary. For example, “I am going for a walk”.

For setting the number of surrounding words that model tries to predict, we define Context Window.

Let the input be one of the words in the vocabulary. We will input the one-hot-encoded representation of this word of dimension V and the model is expected produce an output of series of probability distributions of each word with the output dimension being C*V.

Source: word2vec Parameter Learning Explained by Xin Rong

Word2Vec was a major breakthrough in the field of Word Embeddings because it was able to capture relations in algebraic representations that were never captured before. For example, If we took words such as “King”, “Queen”, “man”, “wife” and mapped these words into the vector space, we found out out that the vector distance between “King” and “Queen” was the same as the vector distance between “man” and “woman”, which could allow us to produce outputs like the following:

Glove

While the Word2Vec technique relies on local statistics (local context surrounding the word) to derive local semantics of a word, the Glove technique goes one step further by combining local statistics with global statistics such as Latent Semantic Analysis (Word Co-occurrence Matrix) to capture the global semantic relationships of a word. Glove was developed by Pennington, et al. at Stanford.

For example, consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus:

GloVe: Global Vectors for Word Representation — Jeffrey Pennington

From the above table, we can observe that ice co-occurs with solid more frequently than it does with gas, and steam co-occurs with gas more frequently than with solid. Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently. From the ratio of probabilities, we can observe that non-discriminative words(water & fashion) have a ratio of probabilities approximately equal to 1, whereas discriminative words(solid and gas) have either high ratio of probabilities or very low ratio of probabilities. Using this method, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phases. These ratios of probabilities are then encoded in the form of vector differences in N-Dimensional space.

Source: https://nlp.stanford.edu/projects/glove/

In the above image, we notice that the vector differences between word pairs such as man & woman and king & queen are roughly equal. The distinguishing factor between these word pairs is gender. As well as this pattern, we can also observe many other interesting patterns in the above visualization from Glove.

ELMo

Before we jump into ELMo, consider this example:

  1. Her favorite fruit to eat is a date.
  2. Joe took Alexandria out on a date.

We can observe that date has different meanings in different contexts. Word embeddings such as Glove and Word2Vec will produce the same vector for the word date in both the sentences. Hence, our models would fail to distinguish between the polysemous (having different meaning and senses) words. These word embeddings just cannot grasp the context in which the word was used.

ELMo resolves this issue by taking in the whole sentence as an input rather than a particular word and generating unique ELMo vectors for the same word used in different contextual sentences.

It was developed by NLP researchers (Peters et. al., 2017, McCann et. al., 2017, and Peters et. al., 2018 in the ELMo paper) at Allen Institute of AI.

Source: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) by Jay Allamar

ELMo uses Bi-directional LSTM, which is pre-trained on a large text corpus to produce word vectors. It works by training to predict the next word given a sequence of words. This task is also known as Language Modeling.

Source: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) by Jay Allamar

ELMo representations are:

  • Contextual: The ELMo vectors produced for a word depends on the context of the sentence the word is being used in.
  • Character based: ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.

Applications of Word Embeddings

Word Embeddings have played a huge role across the complete spectrum of NLP applications. The following are some of the famous applications that use Word Embeddings:

  1. Word Embeddings have been integral in improving Document Search and Information Retrieval. An intuitive approach is to calculate Word Centroid Similarity. The representation for each document is the centroid of its respective word vectors. Since word vectors carry semantic information of the words, one could assume that the centroid of the word vectors within a document encodes its meaning to some extent. At query time, the centroid of the query’s word vectors is computed. The cosine similarity to the centroids of the (matching) documents is used as a measure of relevance. This speeds up the process and resolves the issue of search keywords needing to be exactly the same as those in the document.— Vec4ir by Lukas Galke
  2. Word Embeddings have also improved Language Translation System. Facebook had recently released Multi-Lingual Word Embeddings (fastText) which has word vectors for 157 languages trained on Wikipedia and Crawl. Given a training data for example, Text Corpus having two language formats — Original Language: Japanese; Converted Language: English, we can feed the word vectors of the text corpus of these languages to a Deep Learning model say, Seq2Seq model and let it learn accordingly. During the evaluation-phase you can feed the test Japanese text corpus to this learned Seq2Seq and evaluate the results. fastText is considered one of the most efficient SOTA baselines.
  3. Lastly Word Embeddings have improved Text Classification accuracy in different domains such as Sentiment Analysis, Spam Detection and Document Classification.

--

--

Aayush Dua
CompassRed Data Blog

Machine Learning Engineer at Optum | MSE Data Science @ University of Pennsylvania