Chapter 9.2: NLP- Code for Word2Vec neural network(Tensorflow).

Madhu Sanjeevi ( Mady )
Deep Math Machine learning.ai
4 min readOct 28, 2017

Last story we talked about word vectors , this story we write the code to build the word2vec model using Tensorflow Let’s get started!!!

Overview

Let’s first take a data set ( Unstructured data ) , I take here is How i met your mother series subtitles data , you can take any data ( it does not matter).

So the sentence column is the actual raw data , we need to normalize the data ( removing symbols, spaces and etc…)

so now we have the clean data with us , let’s create a dictionary out of these sentences ( here I take unigrams (1 word) ) you can take bi-grams also,

here we first identified the unique words by taking the count then we created the dictionary. ( word order depends based on the count).

let me take the first sentence in our data set to see how it looks

We just replaced the words with numbers , remember these numbers are not the vectors , these are just the indexes in our dictionary.

Now let’s create the word embeddings ( word2 vec)

To build the model we can use skip gram ( as we discussed in the last story)

Continuous Bag-of-Words model (CBOW)

It predicts one word based on the surrounding words (it takes an entire context as an observation or takes a window sized context as an observation)

Ex: Text= “Mady goes crazy about machine leaning” and window size is 3

it takes 3 words at a time predicts the center word based on the surrounding words → [ [“Mady”,”crazy” ] , “goes”] → “goes” is the target word , and the other two are inputs.

Skip-Gram model

It takes one word as input and try to predict the surrounding (neighboring) words,

[“Mady”, “goes”],[“goes”,”crazy”] → “goes” is the input word and “Mady” and “Crazy” are the surrounding words (Output probabilities)

now we have the skip gram pairs as X and y values , lets create a functions that gives us the pairs batch wise

Now we have the batch inputs to feed to Neural network so let’s build the neural network using tensorflow

As we dicussed in the last story , word2vec model has a 3 layer neural network (input , hidden and output).

Here Hidden layer is just the dot product of inputs and weights ( no activation function here)

Output layer has an activation function ( here it is Noise-contrastive estimation ) we can use softmax also here but NCE is good you can read about it in that paper(NCE internally uses the softmax).

And we can define the size of hidden layer (here i took [voc_size, embedding_size(2)] but you can choose as your wish.

okay let’s train the model for 10000 iterations

So the error is decreased from 66 to 5 which is okay , now the model learned the weights so we got the final embeddings

Let’s visualize them

So here are the first 100 words in vector space

The main theme of word2vec is we get the similar vectors for the words “India”, “China”, “America”, “Germany”, and etc… from a big corpus.

even if we are not labeling or telling that those are country names.

so if I give a text like “ I live in ____” we can get the predictions like “India”, “China”, “America”, “Germany”, and etc…

if we try with different data we get this

Based on the data , it understands the similar world (Idea is similar words appear in similar contexts)

The more data, the better results.

Well, That’s it for this story.

In the next story I will cover another interesting NLP/deep leaning concept until then See Ya!

Full code is available on my Github.

--

--