Implementing Word2Vec in Tensorflow

Saurabh Pal
Analytics Vidhya
Published in
6 min readJun 22, 2019


According to WikiPedia , “Word2vec is a group of related models that are used to produce word embedings ”. Word2vec is a very powerful model released by Google to represent words in feature space while maintaining the contextual relationships.

“A man can be identified by the company he keeps”, similarly a word can be identified by the group of words that are used with it frequently, this is the idea that Word2Vec is based upon.

This article requires thorough understanding of Neural Networks , use the following articles for a quick recap.

An Introductory Guide to Deep Learning and Neural Networks

Understanding and coding Neural Networks From Scratch in Python and R

An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec

Table of contents :-

  1. What are word embeddings?
  2. Continuous Bag of Words Model
  3. Skip Gram Model
  4. Implementation
  5. Visualization
  6. Conclusion

What are word embeddings?

Any algorithm that works on text data needs some representation of words in the form of numbers as computers don’t understand text directly(as of now). Thus , the input words need to be converted into a form understandable by the algorithm , one of the most popular ways is the one hot encoding where every word is represented as a vector containing 1 at its position in the vocabulary.

For example consider our corpus to be a single sentence “The fox is too lazy”. Our vocabulary is [‘the’,’fox’,’is’,’too’,’lazy’]. Now the one hot encoding for respective words are ,

fox -> [0,1,0,0,0] the -> [1,0,0,0,0] is ->[0,0,1,0,0] too ->[0,0,0,1,0] lazy->[0,0,0,0,1]

The problem with such encoding is that they are not able to capture the relationship between different words because all the vectors are independent.The similarity(cosine) between any two one hot encoded vectors will always be 0 .Also , one hot encoding can significantly increase the dimensionality of the dataset as every word of the vocabulary is treated as an individual feature.Thus we need a representation such that similar words have similar representations, this is where Word2Vec comes into the picture. Intuitively ,Word2Vec tries to learn the representation of every word based on the other words that generally occur in its vicinity , thus Word2Vec is able to capture the contextual relationship between words.

The famous King Queen Man example - if we consider the difference of the vector representations of King and Queen produced by Word2Vec then the resulting vector is very similar to the difference of Man and Woman which means that the embedding contains the information regarding the gender.

Word2Vec has two variants , one based on the Skip Gram model and the other one based on Continuous Bag of words model.

Continuous Bag of Words Model

In the continuous bag of words model we try to predict a word using its surrounding words(context words) , the input to the model is the one hot encoded vector of the context words within the window size, window size is a hyper parameter and refers to the number of context words on either side(words occurring before and after the current word.) that are used to predict it.

Window Size

Lets take an example.

“The fox is too lazy to do anything.” . Let’s say the word under consideration is ‘lazy’ , now for window size of 2 , the input vector will have ones at positions corresponding to words ‘is’,’too’,’to’ and ‘do’ .

Skip Gram Model

In skip gram model we try to find out the context words present within the window size for the word under consideration. We will try to implement the skip gram model in the following section.

The skip gram model proceeds as follows

  1. Implement a 3 layered neural network(input,hidden and output)
  2. Input data for training is the one hot encoded vector for the word under consideration, similarly output is the one hot encoded vector for the words that fall within the window size.For example, lets say our corpus is the sentence , “The quick fox jumped over the lazy dog.”.Then the training vectors for skip gram models are

[‘fox’,’quick’], [‘fox’,’jumped’], [‘fox’,’the’] , [‘fox’,’over’] and so on.

3. After training, to get the representation of a word , just take the one hot encoded vector for the word as input and the representation is the output of the hidden layer.

The same model can be used for training both skip gram and continuous bag of words model , only the input and output training vectors get swapped.


First lets import the neccessary libraries

import numpy as np
import tensorflow as tf
import re
import nltk
import sys
from collections import OrderedDict
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline

Now , let’s read the file to be used as our training corpus, here we have used Harry Potter The Sorcerer’s Stone book.

file = open("J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt",'r')
raw_data_1 =

Now we will split the corpus into words and perform some cleaning steps to remove numeric and punctutations etc.

words = raw_data_1.split()
words = [ word.lower() for word in words if len(word)>1 and word.isalpha()]
vocab = set(words)
char_to_int = dict((c,i) for i,c in enumerate(vocab))
int_to_char = dict((i,c) for i,c in enumerate(vocab))

char_to_int is a dictionary that maps every word of our vocabulary to a unique number and int_to_char implements the inverse mapping.

X = []
Y = []
temp_dict = []
window_size = 10
for i in range(len(words)):
a = i-window_size
b= i+window_size
curr_word = words[i]
for z in range(a,i):
if z >=0:
for z in range(i+1,b):
if z<len(vocab):
for pair in temp_dict:
tempx = np.zeros(len(vocab))
tempy = np.zeros(len(vocab))
tempx[char_to_int[pair[0]]] = 1
tempy[char_to_int[pair[1]]] = 1

X and Y have our respective training input and output vectors, every vector in X has 1 at the position for the word under consideration and Y has 1’s at positions corresponding to the context words, window_size is used to regulate the number of context words to be used.

embedding_size = 1000
batch_size = 64
epochs = 32
n_batches = int(len(X)/batch_size)
learning_rate= 0.001
x = tf.placeholder(tf.float32,shape = (None,len(vocab)))
y = tf.placeholder(tf.float32,shape = (None,len(vocab)))
w1 = tf.Variable(tf.random_normal([len(vocab),embedding_size]),dtype = tf.float32)
b1 = tf.Variable(tf.random_normal([embedding_size]),dtype = tf.float32)
w2 = tf.Variable(tf.random_normal([embedding_size,len(vocab)]),dtype = tf.float32)
b2 = tf.Variable(tf.random_normal([len(vocab)]),dtype = tf.float32)
hidden_y = tf.matmul(x,w1) + b1
_y = tf.matmul(hidden_y,w2) + b2
#print(b.dtype)#_y = tf.matmul(x,w)cost = tf.reduce_mean(tf.losses.mean_squared_error(_y,y))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
init_l = tf.local_variables_initializer()
saver = tf.train.Saver()
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.33)
sess = tf.Session(config = tf.ConfigProto(gpu_options = gpu_options))

Our embedding length here is 1000 i.e. every words will be represented as a vector of length 1000.We defined our simple model with a single hidden layer.

for epoch in range(5):
avg_cost = 0
for i in range(n_batches-1):
batch_x = X[i*batch_size:(i+1)*batch_size]
batch_y = Y[i*batch_size:(i+1)*batch_size]
_,c =[optimizer,cost],feed_dict = {x:batch_x,y:batch_y})

avg_cost += c/n_batches
print('Epoch',epoch,' - ',avg_cost)
save_path =,'/home/temp/w2v/word2vec_weights_all.ckpt')

The model is trained for 5 epochs and the corresponding weights are also saved.

embeddings = dict()
for i in vocab:
temp_a = np.zeros([1,len(vocab)])
temp_a[0][char_to_int[i]] = 1
temp_emb =[_y],feed_dict = {x:temp_a})
temp_emb = np.array(temp_emb)
embeddings[i] = temp_emb.reshape([len(vocab)])

Now the embeddings dictionary maps every word of the vocabulary to its vector representation.

def closest(word,n):
distances = dict()
for w in embeddings.keys():
distances[w] = cosine_similarity(embeddings[w],embeddings[word])
d_sorted = OrderedDict(sorted(distances.items(),key = lambda x:x[1] ,reverse = True))
s_words = d_sorted.keys()

The function closest takes any word and n as input and finds n most similar words.


Now , let’s try to visualize our representations using t-SNE.

labels = []
tokens = []
for w in embeddings.keys():
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)new_values = tsne_model.fit_transform(tokens)x = []
y = []
for value in new_values:

plt.figure(figsize=(16, 16))
for i in range(len(x)):
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',

This will plot our words in a feature space with corresponding distances between them.

t-SNE visualization of the vocabulary


The aim of this article is to provide an intuitive understanding of word embeddings and the models used to generate them.I hope that you understand word embeddings after reading this article.You can use your own corpus to create embeddings and experiement with different values of embedding length and window size ,use them for tasks such as sentiment classification etc.

Do read the Word2Vec paper and head over to the Github repo for complete code.

Paper :-