NLP Zero to One: Count based Embeddings, GloVe (Part 6/30)

Co-occurrence Based Models and Dynamic Logistic Regression.

Kowshik chilamkurthy
Mar 2 · 5 min read

Introduction..

Skip-Gram..

The neural architecture and training of skip-gram is very similar to CBOW. So we will limit our discussion on neural architecture of skip-gram. The objective of the skip-gram model is to maximise the average log probability:

Drawbacks of Word2Vec..

  1. Word2vec are local-context based and generally perform poorly in capturing the statistics of the corpus.
  2. Inability to handle unknown or Out Of Vocabulary words: If your model hasn’t encountered a word before, it will have no idea how to interpret it or how to build a vector for it
  3. Word2Vec are local-context based and generally perform poorly in capturing the statistics of the corpus.

Co-occurrence Based Models..

PMI

p(w) is the probability of the word occurring, and p(w1,w2) is joint probability. High PMI indicate strong association between the words.

Co-occurrence methods are usually very high dimensional and require much storage. NLP engineers usually leverage dimensionality reduction techniques to handle high dimensional data. Though global co-occurrence based models succeed in capturing global statistics because of huge storage requirements these models are not able to replace the static Word2Vec embeddings.

GloVe (GLObal VECtors)..

Image generated by author

Word-Word co-occurrence matrix : A matrix X, where a cell Xij is a represents how often the Wi appears in the context of the Wj in the corpus or count of times that Wi and Wj cooccur in the corpus.
Ratios of probabilities: GloVe is based on ratios of probabilities from the word-word co-occurrence matrix, this is the starting point. Let’s look at an example to understand the intuition behind the concept (ratios of probabilities).
Let P(k|w) be the probability that the word k appears in the context of word w. Words {“water” , “ice”} occur together so P(“ice”/“water”) will be high. Also words {“water” , “steam”} occur together, so the P(“steam”/“water”) will also be high.
Ratio: P(“ice”/“water”) ÷P(“steam”/“water”) ; since both numerator and denominator is high ratio will be close to 1. What this ratio = 1 explains is that “water” (which is also called probe word) is very close to words “ice” and “steam” as they occur together. This ratio gives us hints on the relations between three different words. We will leverage this idea to build vectors.

GloVe Training..

For every “U” and “V”, we will create vectors using soft constraint. We will find the vectors by minimizing an objective function J,

Objective function

where V is the size of the vocabulary, X is the Word-Word co-occurrence matrix.

f(.) is a weighting function and have clipped power-law form to handle low co-occurrence of words, which carry less information than the frequent once. So we give less weights to the loss corresponding to these low co-occurrence words in the objective function.

GloVe embeddings can express semantic and syntactic relationships through vector addition and subtraction. GloVe are perform even better than Word2Vec in many NLP tasks as GloVe also captures the global context dependencies.

Note..

Next: NLP Zero to One: Training Embeddings using Gensim and Visualisation (Part 7/30)
Previous: NLP Zero to One: Dense Representations, Word2Vec (Part 5/30)

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

Kowshik chilamkurthy

Written by

RL | ML | ALGO TRADING | TRANSPORTATION | GAME THEORY

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.