Word Embedding : Text Analysis : NLP : Part-3 : GloVe

Jaimin Mungalpara
Nerd For Tech
Published in
5 min readMay 9, 2021

Intuition behind GloVe

As we have seen in previous article of Word2Vec, it can take local statistics mean local context information of the word, while “Global Vectors” (GloVe) can cater global context information of the words means it can work on unseen words as well. GloVe model was introduced in 2014 by Jeffrey Pennington, Richard Socher, Christopher D. Manning. However, Word2Vec had some disadvantages but in combination with LDA it can generate GloVe model. So, we will use Word2Vec also in construction of GloVe model.

GloVe

We can derive semantic relationship between words with the help of co-occurrence matrix. Given a corpus has U words so the co-occurrence matrix X will be UxX. In this matrix ith raw and jth column will represents Xij which is the value of occurrence of 2 words at ith raw and jth column position. We can see pictorial representation of co-occurrence matrix.

For example, we have below 2 sentences

Data is next oil.

Data is future.

Based in these 2 sentences we can build below co -occurrence matrix. Finally, let P(j|i) = Xij/Xi be the probability that word j appear in the context of i. Here, in below table if we want to calculate P(Data|is) = 2/4.

Let’s take the same example from a research paper to show the power of co occurrence matrix and how we can use the matrix which can calculate semantic similarity between words. Let’s take an example which was take in research paper mentioned above.

Source:- https://nlp.stanford.edu/pubs/glove.pdf

P_ik/P_jk where P_ik = X_ik/X_i.

So as per above formula and table we can see that probability of solid with ice is higher than gas with ice. So this way we can see that whenever words are related, the probability is high and words are not related the probability is low. But when a word is related to more words, for example water is related to steam and ice, so in these cases we will get high probability. To resolve this issue P_ik/P_jk is taken.

So, when we try to use this co-occurrence matrix directly it would have a problem of dimension it is in millions. Hence, GloVe has introduced the use of co-occurrence matrix with Word2Vec to solve the issue.

Source:- https://nlp.stanford.edu/pubs/glove.pdf

We can see in above formula we are using word vectors i,j,k where k is the context vector. So when we pass these 3 vectors from a function F we will get the ratio of probabilities.

Let’s work on equation

From above equation we have to deal with several issues which are. First, we have ratio of probabilities as a scalar and left hand side we have vectors, so we have to convert vectors into scalar. Second, choosing a function F to apply on vector. Third, we have 3 variables i,j,k and it is difficult to prepare cost function with 3 variables. Let’s solve these issues step by step.

  1. To convert vectors into scalar we will use dot product of two vectors but the issue is we have three vectors here. So, we will take vector subtraction for i and j vectors. This is being taken from Word2Vec model because we can get analogy between words and take transpose of this to match the dimensions. We can see this in below formula.

2. Now what can F be, it has been taken based on homomorphism between subtractive and divisive group which would be like this .

Similarly, this particular homomorphism ensures that the subtraction of F(X-Y) can be given as a division F(X)/F(Y) and get the same result.

Now the solution for above formula is

Based on given calculation we can conclude that we can take the solution as an exponential function. So the Formula would be like this.

Let’s introduce some biases bi and bk in above equation for the symmetry of the equation.

3. Now it’s time to work on cost function. In above equation the problem is when Xik is 0. One solution is to include an additive shift in the logarithm, log(Xik ) → log(1 + Xik ), this is called Laplacian transform, which maintains the sparsity of X while avoiding the divergences. This is the property of LSA and this way we are combining LSA and Word2Vec to generate GloVe.

After adding this part the problem is the, even after rare occurrence of co variance matrix they are weighted equally. So, to resolve these issues a weighted regression model was added and weighting function f (Xij) into the cost function. So, the cost function formula would be like this.

Here, we have get a rid of third vector k in previous formula while defining F . So, j would be working like context vector in this formula.

Implementation of GloVe model

Next, we will take a look on FastText for Embedding. Suggestions are heartly welcomed.

References

  1. https://medium.com/analytics-vidhya/glove-theory-and-python-implementation-b706aea28ac1
  2. https://nlp.stanford.edu/projects/glove/

--

--