Word Embeddings: An Introduction to the NLP Landscape

Published in

Analytics Vidhya

8 min readSep 19, 2019

This article aims at explaining the key concept of NLP, Word Embeddings, in a simple way to provide a high-level idea of what Word Embeddings are, how can they be used, and why they are the key to building NLP models.

Prerequisites:

— Machine Learning Concepts

— Basics of NLP

Word Embeddings:

Word Embeddings are a way to represent words in a vector space of D dimensions, where D can be chosen by you. This vector representation can be used to perform mathematical operations on words, find word analogies, perform sentiment analysis, etc.

The most basic embedding that is widely used is the One Hot Encoding technique, which represents categorical features in vector space by dedicating each word a column. This One-Hot Encoded vector is of size N x V, where N is the number of observations and V is the vocabulary size.

Why are other Word Embedding methods preferred over One-Hot Encoding?

The reason is one hot encoded vectors can be used to represent the words in vector space, but the meaning or context of the words are not captured by them. For example, other Word Embeddings algorithms can be used to find similar words i.e. words that used in the same context or with similar words, and this contextual similarity is calculated by using the distance metrics like Euclidean distance or cosine similarity.

With One-Hot Encoded Vectors, only one column has a value of 1 and others have 0, thus the Euclidean distance between any two one-hot word vectors will always be sqrt(2), thus all words are considered equally similar. Distance can be measured by Euclidean formula (smaller => closer) or Cosine Distance (larger => closer).

Word Embeddings, on the other hand, capture the context similarity of words i.e. ‘cat’ and ‘feline’ vectors have smaller distance than ‘cat’ and ‘airplane’. Another reason for using Word Embeddings over One-Hot Encoded vectors is that the dimensionality of the One-Hot Encoding increases with increase in the no. of unique categories or in NLP terms the vocabulary. Whereas in Word Embeddings, we can choose the output dimension irrespective of the vocab size. Some advanced Word Embeddings algorithms like BERT and ELMo are also able to deal with negations like “not good” unlike the BOW model.

Algorithms for creating Word Embeddings:

Embedding Layer
Word2Vec
GLoVe
FastText
ELMo
BERT

Word Embedding Algorithms

Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text. The learning process for embeddings can either be as a part of the the neural network model on some task, such as document classification, or can be learnt in an unsupervised way using document statistics.

1. Embedding Layer

This is a keras built-in layer that makes it easy to use embeddings as a part of the neural network. The layer generates the weights/embeddings matrix based on the size of the vocabulary of the corpus, V, and the embedding dimensionality, D, as specified in the layer definition. This VxD weight matrix is updated at each iteration and the resulting weights contain the weight or embedding of the word at the corresponding word index.

This layer can be used to generate embeddings for a set of specified supervised learning tasks or can be used for learning the contextual representation of the words for your specific use case.

2. Word2Vec

Word2Vec algorithm focuses on building vector representation of words based on the contexts they have been used in. Google’s pretrained word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset with the embedding dimensionality as 300, can be downloaded and used in your projects directly.

There are two model architectures that are used to generate embeddings:

Continuous Bag-of-Words, or CBOW model,
Continuous Skip-Gram Model.

The CBOW model learns the embeddings by predicting the current word based on its context words found using the context size which is a hyperparameter that is chosen by the user.

The Continuous Skip-Gram model learns the embeddings by predicting the surrounding context words from a target word.

Word2Vec Models from “Efficient Estimation of Word Representations in Vector Space”, 2013

The context window-based methods are oblivious to the co-occurrence statistics of the corpus and thus, fails to take advantage of the vast amount of repetition in the data. For more information about Word2Vec and its internal working, you can refer this.

3. GloVe

The GloVe algorithm uses context-counting approach to builds a word co-occurrence matrix and trains the word vectors to predict co-occurrence ratios based on their differences.

Before Word2Vec, the matrix factorization techniques like Latent Semantic Analysis (LSA) were used to generate the word embeddings. In LSA, the matrices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus. Word Vectors were generated by decomposing term-document matrices using Singular Value Decomposition. The resulting embeddings were not able to express word analogies into simple arithmetic operations unlike Word2Vec.

GloVe, on the other hand, uses local context to compute the co-occurrence matrix using a fixed window size (words are deemed to co-occur when they appear together within a fixed window). After this, GLoVe aims to predict the co-occurrence ratios using the word vectors.

Glove might result in generating better embeddings faster than word2vec as GloVe uses both the global co-occurrence statistics as well as local context. For more details about the math behind glove you can refer here.

4. FastText

FastText splits out words using n-gram characters. Contrary to other popular models that learn word representations by assigning a distinct vector to each word, FastText is based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.

This approach is a significant improvement over word2vec and GloVe for two reasons:

· The ability to infer out-of-vocabulary words. Example, ‘England’ is related to ‘Netherlands’ because of land present in both as ‘lan’ and ‘and’.

· The robustness to spelling mistakes and typos.

For more information on the inner workings of FastText, you can refer this.

5. ELMo (Embeddings from Language Models)

ELMo representations are a function of all the internal layers of the biLM, i.e. the linear combination of the vectors stacked above each input word for each end task. The ELMo architecture takes a string as an input, generates raw word vectors using character level CNN. These raw word vectors are passed to the first pretrained Bidirectional Language Model (biLM) layer, the information extracted at this layer forms the intermediate word vectors. These intermediate word vectors are passed as input to the second pretrained biLM, the information extracted from this biLM forms the second intermediate word vectors.

The three word vectors generated through the layer are combined using weighted sum and form the ELMo word embeddings. Combining the internal states in this manner allows for very rich word representations, i.e. biLM is computed from characters rather than words, it captures the inner structure of the word. ELMo representations alone significantly improves relative error up to 20%.

Representational Diagram of ELMo. Source: Analytics Vidhya Blog

ELMo can be directly used by importing the module from tensorflow hub. The default Elmo implementation on the tfhub takes string type tokens as the input, if provided with a complete sentence, it will split it based on spaces. Thus, using standard text preprocessing methods may help provide better results. The training and prediction using ELMo is slow due to the highly complex and deep architecture. For more information about ELMo, you can refer this.

To understand how to use ELMo in your project, you can refer this.

Character-Level Embeddings

This technique uses ConvNets to extract information from character-level encoded texts. This is used as the first step in ELMo to generate the raw word vectors. It has been shown that ConvNets can be directly applied to distributed or discrete embedding of words, without any knowledge on the syntactic or semantic structures of a language, which makes it competitive to the traditional models. For more information about the use of ConvNets as a way to embed text, you can refer to this paper.

6. BERT (Bidirectional Encoder Representation from Transformer)

BERT is another recent state of the art algorithm which is widely used for NLP tasks. It uses WordPiece embeddings which splits the words into their subword units, i.e. writing becomes write + ing. This split help reduce the vocabulary size. The BERT architecture uses Transformer model underneath which maintains the attention over the sequence. More information about the transformer model can be found here.

Another difference in BERT is that unlike ELMo, which uses a biLM where the one LSTM is fed the word and the context words before that word and the other LSTM is fed the target word and the context words forward of the target, BERT passes the full sequence directly, i.e. the input looks like

The forward and backward LSTM inputs for ELMo along with the combined input for BERT

This helps the model with the information of the entire sentence to make better predictions. BERT masks the target word BERT has been shown to outperform the general word embeddings as well as ELMo. For more information and in-depth analysis of BERT please refer here. Google has pre-trained BERT on Wikipedia.

To know how to find word embeddings using BERT, you can refer to this article.

Different Ways of Using Word Embeddings

1. Learning the Embedding

The embeddings can be learnt from the corpus but a large amount of text data is required to ensure that useful embeddings are learned. Word Embeddings can either be trained using a standalone language model algorithm like Word2Vec, GLoVe, etc., which proves more useful in case we want to use the embeddings in multiple models, or we can train the embeddings as a part of a task-specific model like classification, the main issue of this method is that the learnt embeddings are only specific to the task at hand and thus can’t be reused.

2. Reusing Pretrained Embedding

Most of the word embeddings trained by researchers using the above-mentioned algorithms are available for download and can be used in projects depending on the license of embeddings.

The embeddings can be reused either by keeping them as non-trainable in your models if you want to use for general tasks for which these embeddings have been trained for, or you can allow the embeddings to be updated which gives better results for the task at hand.

Conclusion

In this article, we have gained an insight about Word Embeddings and their usage as well as different language models that can be used to generate them. In the subsequent article, we will look into how we can use these word embeddings to generate document embeddings.

The rapidly growing landscape of NLP is an active research area where even relatively recent techniques like BERT are no longer state of the art due to the introduction of XLNet which has surpassed BERT in 20 NLP tasks.