Word2Vec in Practice for Natural Language Processing

Roshan Nayak
The Startup
Published in
6 min readMay 29, 2020
Photo by Chris Ried on Unsplash

In this post I would be taking you through:

  1. What is Word Embeddings?
  2. Data Preprocessing for Word2Vec.
  3. Training a Word2Vec model.
  4. Word similarity.

Before diving into Word2Vec we have to know what actually word embedding is and why is it actually required. So let's get started.

What is Word Embeddings?

Word embeddings plot on a 2D plane. Source of the image: shorturl.at/FHQSU

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. As we cannot feed in the words directly into the neural networks we have to somehow represent them in the form of vectors of some length. The length of these vectors is basically a hyperparameter. There are a number of algorithms using which we can map the words to the vectors. One of those algorithms is Word2Vec. This algorithm performed well when compared with other algorithms that already had been existed and we in the use before.

Now why Word2Vec is better when compared to Bag of Words and TF-IDF? When we refer to how actually Bag of words and TF-IDF work we can clearly see that they don't actually give the sense of where actually the word is being used or in which context the word is being used. They are just a vector having numbers that do not make much sense. Whereas the word embeddings done by word2vec are such that the words with similar meanings have almost the same embedding vector. This gives us a sense of where this particular will actually be used. Below you can see how the word embeddings using word2vec of some of the words would look like. The values are color-coded. Each color represents some value.

word embeddings of king and queen are almost similar also that of the man and woman. Source of the image is from the Jay Alammar blog: http://jalammar.github.io/images/word2vec/queen-woman-girl-embeddings.png

Now if you do some arithmetic like King — Man + Woman = Queen.

This could be explained as the word King would mean royalty and a male as the gender. If we subtract man from it the information related to gender is lost but the royalty still prevails. When you add Woman to it, which means now the embedding has the information of royalty and the female as the gender. Which is basically a Queen. In this way, each word vector will hold some meaning into it which gives a sense of its meaning. Hence words with similar meaning will have almost the same embedding.

If you would like to learn more about what goes under the hood while learning these embeddings using Word2Vec you could read the following post:

http://jalammar.github.io/illustrated-word2vec/

Now that we have got a sense of what word embedding is, let's look into how do we actually embed the words using Word2Vec.

At first, you need to install the Gensim library into your local machine if you do not have it. This could be done using simple pip or conda commands.

Now let's look into the data preprocessing part. The dataset I am using contains some simple sentences. I will later provide the GitHub link from where you can get the dataset.

Data Preprocessing for Wore2Vec

Genism word2vec requires that a format of ‘list of lists’ for training where every document is contained in a list and every list contains lists of tokens of that document. At first, we need to generate a format of ‘list of lists’ for training the make model word embedding. To be more specific its a list of lists of words. Something like this: [[‘I’, ‘am’, ‘here’], [‘yes’, ‘it's’, ‘correct’]]. The code to generate the list of lists is given below:

#import the libraries
import pandas as pd
from nltk.tokenize import word_tokenize
#read the data from the excel/csv file.
data = pd.read_excel(file_path, names = ['sentences'])
sentences = list(data['sentences'])
sentences = [sentence.lower() for sentence in sentences] #lower all the uppes case characters.
list_of_list = [word_tokenize(sentence) for sentence in sentences] #Create the required list of list format.
Output: Now the first two elements of the list would look something like this
>>>list_of_list[0:2]
[['drunk', 'bragging', 'staffer', 'started', 'russian', 'collusion', 'investigation'], ['sheriff', 'david', 'clarke', 'becomes', 'an', 'internet', 'joke', 'for', 'threatening', 'to', 'poke', 'people']]

So basically a list of lists is nothing but a list containing the list of words.

Training a Word2Vec model

Now let's look into the parameters of the word2vec model.

  1. list_of_lists: This is the same list of lists that we generated during the preprocessing phase.
  2. min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
  3. size: The size of the embedding vector for each word and the default is 100.
  4. workers: The number of partitions during training and the default workers is 3.
  5. window: The maximum distance between a target word and words around the target word. The default window is 5.

Now to train the model code is given below:

>>> from gensim.models import Word2Vec
>>> model = Word2Vec(list_of_list, min_count=1,size= 50,workers=3, window =4)

Now to have a look at the word embeddings of any of the word with min_count = 1, you can just use indexing and the indexing value to be that word. An example is provided below.

>>> model['drunk']
array([ 0.00175025, -0.00631893, -0.00175 , 0.00838332, 0.00280083, 0.00377184, 0.00183425, -0.00615874, 0.00777504, 0.00380259, -0.00834231, 0.00379236, 0.00138776, -0.0038057 , 0.00721396, 0.00163374, 0.00017319, 0.00034911, 0.00850956, -0.00258099, -0.00646694, 0.00314683, 0.00375721, -0.00179866, 0.00040652, 0.00085937, -0.00172446, 0.00848798, -0.00937741, -0.00678314, 0.00155409, 0.00540033, -0.00324092, 0.0082706 , 0.00548601, -0.009266 , 0.00593508, 0.00584892, 0.00901981, 0.00890178, -0.00093049, 0.00156764, 0.00205584, -0.00097351, 0.00111838, 0.00421014, -0.00288082, 0.00719491, -0.00207634, -0.00019168], dtype=float32)

The similarity between the words

Before having a look at the similarity between two words, let's see how the similarity is actually calculated. As here the words are represented as vectors, we calculate the similarity between these vectors.

How do we check the similarity between two vectors? Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

visualizing cosine similarity calculation. Source of the image: shorturl.at/tJKSZ

And the formula to calculate the similarity is given by

Mathematical expression to calculate cosine similarity. Source of the image: shorturl.at/fmMQ7

Now that we have a trained model, the similarity() method of the model would calculate the similarity between two words. Let's have a look at the similarity between some of the words:

>>> model.similarity('he', 'robert')
0.31170955300331116
>>> model.similarity('he', 'handcuffs')
0.13632124662399292

This shows that ‘robert’ is more similar to ‘he’ than ‘handcuffs’.

The most_similar() method would give the top ten words that have very good similarity with the word under consideration.

>>> model.most_similar('republican')
[('investigation', 0.3500652015209198),
('going', 0.2938186824321747),
('racism', 0.21984194219112396),
('mocks', 0.2088545560836792),
('out', 0.20704494416713715),
('mueller', 0.19829972088336945),
('brutally', 0.18746031820774078),
('fbi', 0.1826065480709076),
('for', 0.17787611484527588),
('staffer', 0.17109911143779755)]

Summary

We worked on a very small dataset so that you can understand better instead of working on a huge dataset in the beginning. I would request you work this on a larger dataset to get a better sense of word embeddings and the cosine similarity between the words.

This is it for this post. Thank you for reading. Have a great day ahead :) :) :) :). Do upvote if you like it and have learned something from this.

GitHub link for the dataset and the code is given below:

References:

  1. Word2Vec form Jay Alammar: http://jalammar.github.io/illustrated-word2vec/
  2. A blog from Jason Brownlee: https://machinelearningmastery.com/what-are-word-embeddings/

Do connect me on LinkedIn. My LinkedIn profile link is given below:

If you haven’t read my previous blog on Data Augmentation in Natural Language Processing, do read it.

--

--