Intuition Behind Word Embeddings in NLP For Beginners?

Understanding Word2Vec, CBOW, Skip-gram model.

@pramodchandrayan
Predict
9 min readAug 19, 2020

--

“Word means nothing until it is clubbed to represent the context and the right kind of human emotion apt for the given situation. ”

Welcome back folks,

To my 5th article in NLP series, earlier we covered

So, Before we go into understanding how words are represented in the form of vectors using word2vec it is important to understand

What is Meaning of Any Given Word?

If we human have to express any idea we need to rely on giving some textual label to it which become words, or we create a phrase or we rely on some visual tool to do so. This representation of our idea is what can call a word meaning.

But in the field of textual computation, this meaning of words alone doesn’t help a lot, as the surrounding nuances of the context are often missing in those meanings. But we humans have come up with the technology to make this word meaning useful and most important one has been understanding the taxonomy of words which can help us answer some of the fundamental answers like

  • Word Synonyms
  • Words Antonyms
  • Words hypernyms

For example, the WordNet package in python helps us meet some of the above-mentioned objectives in word computational system.

Problem With Discrete Representation of Words:

The idea behind WordNet has some fundamental issues, though it is extremely useful in terms of giving multiple word representation of any given words for example synonyms :

Beautiful: Alluring, appealing, charming, cure, dazzling, good-looking

Here the taxonomical representation of beautiful gives us a bunch of related words which actually can have a different meaning in the different context/situation. Here the required situational nuances are missing. The actual similarity between the word is not effectively represented here, which is kind of problematic when we are trying to make sense of any words or the phrases.

What Is The Solution To Discrete Word Representation?

As we discussed above one can completely misread the actual meaning of any word if the accompanied nuances like situation, the emotion are not taken into consideration while establishing the relationship between a given set of words.

The discrete word representation is extremely subjective which fails to compute the similarity between a given words and sometimes misses the nuances completely

To get rid of such issues the computational linguistics came up with idea of “Distributed represented of words

Idea Behind Distributed Word Representation

“If you have no problem nothing new can be invented or discovered”

So the above problem of discrete word representation which generally fails to capture the idea of similarity between the words to represent it effectively, is what has lead to the idea of the distributed word representation.

Where we represent any given words based on the meanings of the neighboring words and end up extracting the real value out of it.

For Example: Read the below para,

“ The world of online education will become extremely relevant and significant for students who are looking to acquire new skills and get a world-class education from world-class teachers. ”

So if we carefully process the above paragraph, we can easily represent the word education which will be represented by its other neighboring words like online, students, teachers, skills, etc…

This idea of representing any words has been extremely useful for computational linguistics and is the core concept behind the world of word embeddings in NLP

Word Embeddings In NLP?

So if one has to represent the given world which is represented by the accompanied neighboring word the best kind of representation is the Vector representation of those words.

What Is Word Embeddings?

As per wiki:

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers

In a more simplified term :

Word embeddings are a type of word representation in the form vector array, that allows words with similar meaning to have a similar representation.

The idea was to come up with the dense representation of each word as a vector that can overcome the limitation of discrete one-hot encoding representation which was expensive, has dimensionality issues, and was extremely sparse.

Conceptually, word embeddings involve the concept of dimensionality reduction of the correlated words, the probabilistic language models, to capture the context of the words to be used in the neural network architecture.

To sum up the idea of word embeddings, it would be apt to quote the famous lines said by John Firth:

You shall know a word by the company it keeps!

As per Tensorflow:

An embedding is a dense vector of floating-point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way, a model learns weights for a dense layer).

Below is the 4-dimensional vector representation of word cat, mat, on :

source

What Are Some Of The Most Used Word Embeddings Techniques in NLP?

  1. Word2Vec
  2. GloVe

We will cover Word2Vec word embedding technique in details and will look into GloVe embeddings in the Next part of NLP series

Word2Vec Embeddings :

It is one of the most established methods of generating word embeddings.

Word2Vec was developed by Tomas Mikolov of Google in 2013, with an objective to make the bring efficiency in the Neural network-based model. Now it is the de facto standard for developing pre-trained word embedding.

Word2Vec represents each distinct word with a particular list of numbers called a vector. It helps us to efficiently learn any given word embeddings from the set of a text corpus, using a neural network which learns word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence

Word2Vec uses the math of cosine similarity between the given sectors in such a manner that the words represented by those vectors are semantically similar.

Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space, such that words that share common contexts in the corpus are located close to one another in the space

A Simple Word2Vec NN Archtecture has,

  • A single hidden layer,
  • Fully connected neural network
  • The neurons in the hidden layer are all linear neurons.
  • The input layer is set to have as many neurons as there are words in the vocabulary for training.
  • The hidden layer size is set to the dimensionality of the resulting word vectors. The size of the output layer is the same as the input layer
Word2Vec Archirecture

Types Of Model Architecture Proposed By Word2Vec Embeddings :

  • CBOW: Continuous Bag Of Words Model
  • Continuous Skip-gram model

CBOW: Continuous Bag Of Words Architecture:

The CBOW model learns the embedding by predicting the target word based on its context or we can say based on the surrounding words.

The context here is represented by multiple words for a given target word. The objective of the CBOW model is to learn to predict a missing word given the neighboring words

Intuition Behind CBOW:

Let’s understand the main intuition behind CBOW model using the simple example, Suppose the phrase is : “ The black monkey went mad”

Now CBOW has to predict the target word monkey based on the surrounding word {The, black, went, mad} which may look like the below combination of words:

(The →monkey),(black →monkey), (went →monkey),(mad →monkey)

The CBOW visualization for the above example will look like :

CBOW Simple Architecture :

CBOW: Source

Where,

  • Input layer holds the possible surrounding context
  • The output layer hols the current word
  • The hidden layer contains the number of dimensions in which we want to represent the target word present at the output layer. CBOW Uses both the n-words before and after the target word w(t) to predict

Continuous Skip-Gram Model:

It functions exactly the opposite of CBOW. In the skip-gram model, instead of using the surrounding words to predict the missing/target word, it uses the target word to predict the surrounding words or context.

The continuous skip-gram model learns by predicting the surrounding words given a current center word.

So, for the same phrase “ The black monkey went mad”

Skip Gram model would look like :

Where,

  • We are trying to predict the context of the given target word monkey
  • Here the representation:
  • (monkey →The), (monkey →black), (monkey →went), (monkey-mad )

The Skip-gram model reverses the use of target and context words. In this case,

Simple Architecture of Skip-gram Model:

skip-gram
  • The target word is fed at the input layer as, w(t)
  • The hidden layer remains the same
  • The output layer of the neural network is replicated multiple times to accommodate the chosen number of context words, represented as shown in the above image below.

The skip-gram objective thus sums the log probabilities of the surrounding n-words to the left and to the right of the target word wt to achieve the objective of finding the context.

Word2Vec Applications In Real World:

Analyzing Verbatim Comments:

Many big and small organizations use the power of word2vec embeddings to analyze the customer comments to find verbatim in it. When you’re analyzing text data, an important use case is analyzing verbatim comments. In such cases, the data engineers are given a task to come up with an algorithm that can mine customers’ comments or reviews.

Building Product/Movie/Music Recommendation:

The returning or new customer coming to any eCommerce portal, or media content portal gets the personalized recommendation not only based on what another customer whose behavior of content browsing is similar to them, but also based on, what kind of content/product is being experienced by the customers in the given situation together. This information will add value to offer better customer offerings and an awesome user experience.

Some more common applications are

  • Sentiment analysis
  • Document classification

and many more.

What’s Next In NLP Sereies ?

Well, we will go hands-on to understand how one can implement CBOW and Skip-gram Word2Vec model using the python and gensim word2vec model. Also, we will cover, GloVe embeddings

Signing-off with food for thought:

“Our existence as human beings is nothing without the nature in which we exist. Our life will become lifeless, if we are not taking good care of our surroundings which mother nature has bestowed on us . Same goes with linguistic computation, the words will have no meaning until the surrounding words come together to extract the meaningful relationship and eventually helps us make sensible and meaningful predictions. ”

Thanks a lot and look forward see you all, in Part 6 of the NLP series…..

--

--

@pramodchandrayan
Predict

Building @krishaq: an Agritech startup committed to revive farming, farmers and our ecology | Writes often about agriculture, climate change & technology