GLoVE: Theory and Python Implementation
GloVe: Global Vectors for Word Representations
In this post we will go through the approach taken behind building a GloVE model and also, implement python code to extract embedding given a particular word as input.
Fundamentally, all the language models developed strove towards achieving one common objective of accomplishing the possibility of transfer learning in NLP. So, different educational as well as commercial organizations sought different approaches in achieving this goal.
One such prominent and well proven approach was building a co-occurrence matrix for words given a huge corpus. This approach was taken up by a team of researchers at the Stanford University, which turned out to be one simple yet effective method of extracting word embeddings for a given word.
Contents:
- Recap on Word Embeddings
- Introduction to Co-occurrence matrix
- Cost function for optimization
- Python implementation
Recap on Word Embeddings:
Word Embeddings are vector representations of words which help us extract linear substructures as well as process the text in such a way that the model would better understand. Typically, word embeddings are weights of the hidden layer of the neural network architecture, after the defined model converges on the cost function. For deeper understanding of this refer below:
Theory behind Word Embeddings in Word2Vec
Intro to Co-occurrence matrix
Let us start understanding the co-occurrence matrix by its definition. Co-occurrence matrix, primarily gives information about the frequency of two words appearing together in the huge corpus. Consider the below screenshot,
Here, X1, X2 etc.are the unique words in the corpus and Xij represents the frequency of Xi and Xj appearing together in the whole corpus. Although, this matrix as a whole doesn’t necessarily serve our purpose, it just becomes the target on which the neural network is trained upon. In other words, given an input of one hot embedding vector of a particular word (same as in Word2Vec), the model is trained to predict the co-occurrence matrix.
So, on the whole predicting the co-occurrence matrix is a fake task that was defined in order to extract the word embeddings, once the model converges.
Cost Function:
For any machine learning model to converge, it inherently needs a cost or error function on which it can optimize. In this case the cost function is:
Here, J is the cost function. So, let us traverse through the terms one-by-one:
- Xij is the frequency of Xi and Xj appearing together in the corpus
- Wi and Wj is the word vector for word i and j respectively.
- bi and bj corresponds to the biases w.r.t words i and j.
In the second equation, Xmax is a threshold for the maximum co-occurrence frequency, a parameter defined to prevent the weights of the hidden layer from being blown off. So, the function f(Xij) is essentially a constraint defined on the model.
Once, the cost function is optimized, the weights of the hidden layer becomes the word embeddings. The word embeddings from GLoVE model can be of 50,100 dimensions vector depending upon the model we choose. The link below provides different types of GLoVE models released by Stanford University, which are available for download.
Python implementation
Higher the number of tokens and vocabulary, better is the model performance. Also, we need to consider the architecture at our possession, to use the right model for faster computation. We will use 100 dimensional glove model trained on Wikipedia data to extract word embeddings for a given word in python.
Also, print(embedding_index[‘banana’]) command gives the word embedding vector for the word banana and similarly, embedding vector for any word can be extracted. Follow the below snippet of code to find the cosine similarity index for each word.
Also, the linear substructures can be extracted which has been discussed in my previous post.
The link below redirects to you to the code file for extracting word embeddings in python from pre-trained GLoVE model.
Word Embeddings from GLoVE 100D model
Follow this space for more content on embeddings as I’m planning to write a series of posts leading up-to BERT and its applications. Any feedback on this is much appreciated.