Word Embedding: Word2Vec With Genism, NLTK, and t-SNE Visualization

Afaf Athar
The Startup
Published in
9 min readAug 16, 2020

What is Word Embeddings?

In extremely simplified terms, Word Embeddings are the writings changed over into numbers, as there might be diverse numerical portrayals of a similar book. Be that as it may, before we jump into the subtleties of Word Embeddings, the accompanying inquiry ought to be posed — Why do we need Word Embeddings?

For reasons unknown, many Machine Learning calculations and practically all Deep Learning architectures are not capable of handling strings or raw content in their crude structure. They require numbers as contributions to play out such a vocation, be it grouping, relapse, and so forth in broader terms. What’s more, with the tremendous measure of information that is available in the content organization, it is easy to extract information out of it and manufacture applications.

Some live uses of text applications are — opinion investigation of surveys by Myntra, Amazon, and so on., record or news arrangement or grouping by Google and so forth. Several word embedding approaches currently exist and all of them have their pros and cons. We will discuss one of them here: Word2Vec.

For instance, believe our corpus to be a solitary sentence “The quick brown fox jumps over the lazy dog”. Our sentence is [‘the’,’ quick’,’brown’,’ fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]. Presently the one-hot encoding for separate words are,

The -> [1,0,0,0,0,0,0,0,0] , quick -> [0,1,0,0,0,0,0,0,0] brown -> [0,0,1,0,0,0,0,0,0] , fox -> [0,0,0,1,0,0,0,0,0] , jumps -> [0,0,0,0,1,0,0,0,0] ,over -> [0,0,0,0,1,0,0,0,0] , the -> [0,0,0,0,0,0,1,0,0] ,lazy -> [0,0,0,0,0,0,0,1,0], dog -> [0,0,0,0,0,0,0,0,1]

Woed2Vec Example

Word2Vec:

Word2vec is a gathering of related models that are utilized to create word embeddings. These models are shallow, two-layer neural systems that are prepared to remake etymological settings of words. Word2vec takes as its info an enormous corpus of text and produces a vector space, normally of a few hundred measurements, with every extraordinary word in the corpus being allocated a comparing vector in the space.

Word vectors are situated in the vector space to such an extent that words that share regular settings in the corpus are found near each other in the space

First — An input layer, Middle-Hidden layers, Last- Output layer

“A man can be distinguished by the organization he keeps”, comparably a word can be recognized by the gathering of words that are utilized with it often, this is the possibility that Word2Vec depends on. Word2Vec has two variations, one dependent on the Skip Gram model and the other one dependent on the Continuous Bag of words model.

Skip Gram Model:

For the Skip-Gram model, the undertaking of the basic neural system is: Given an info word in a sentence, the system will foresee how likely it is for each word in the jargon being that information word’s close by word. The preparation guides to the neural system are word sets which comprise of the info word and its close by words.

For instance, consider the sentence “ The quick brown fox jumps over the lazy dog.” and a window size of 2. The preparation models are All together for the guides to be prepared by the neural system, we need to speak to the words in some numerical structure. We utilize one-hot vectors, in which the situation of the information word is “1” and every single other position is “0”. So, the contributions to the neural system simply input one-hot vectors, and the yield is additionally a vector with the component of the one-hot vector, containing, for each word in the jargon, the likelihood that an arbitrarily chosen close byword is that jargon word.

Presently how about we take a gander at the design of the neural system. For instance, accept we utilize a jargon of size V, and a shrouded layer of size N, the accompanying chart shows the system’s design:

Skip-Gram Model Working

Continuous Bag of Words Model:

The continuous Bag-of-Words model (CBOW) is just the opposite of Skip-Gram. For the CBOW model, the task of the simple neural network is: Given a context of words (surrounding a word) in a sentence, the network will predict how likely it is for each word in the vocabulary is the word.

In Continuous Bag-of-Words model, we attempt to foresee a word utilizing its encompassing words(context words), the contribution to the model is the one-hot encoded vector of the setting words inside the window size, the window size is a hyper boundary and alludes to the quantity of setting words on either side(words happening when the current word.) that are utilized to anticipate it.

“ The quick brown fox jumps over the lazy dog.”. Suppose the word viable is ‘sluggish’, now for a window size of 2, the information vector will have ones at positions comparing to the words ‘quick’, fox’,’ over’, ’the’, ’lazy’, and ‘dog’.

CBOW Model Working

Implementation:

Below I define four parameters that we used to define a Word2Vec model:

·size: The size means the dimensionality of word vectors. It defines the number of tokens used to represent each word. For example, rake a look at the picture above. The size would be equal to 4 in this example. Each input word would be represented by 4 tokens: King, Queen, Women, Princess. Rule-of-thumb: If a dataset is small, then the size should be small too. If a dataset is large, then size should be greater too. It’s the question of tuning.

·window: The maximum distance between the target word and its neighboring word. For example, let’s take the phrase “agama is a reptile “ with 4 words (suppose that we do not exclude the stop words). If the window size is 2, then the vector of the word “agama” is directly affected by the word “is” and “a”. Rule-of-thumb: a smaller window should provide terms that are more related (of course, the exclusion of stop words should be considered).

·min_count: Ignores all words with a total frequency lower than this. For example, if the word frequency is extremely low, then this word might be considered as unimportant.

·sg: Selects training algorithm: 1 for Skip-Gram; 0 for CBOW (Continuous Bag of Words).

·workers: The number of worker threads used to train the model.

The model building:

Used the hotel-reviews dataset from the Kaggle repository. Click here for the dataset

Steps-

  1. Clean the data
  2. Build a corpus
  3. Train a Word2Vec Model
  4. Visualize t-SNE representations of the most common words
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import re
import nltk
import gensim
from gensim.models import word2vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline
nltk.download('stopwords')
nltk.download(‘stopwords’)

Loading the hotel-reviews dataset in data and viewing it’s top 5 rows.

data = pd.read_csv('/content/hotel-reviews.csv',sep=',',encoding='utf-8',error_bad_lines=False)data.head()
Top 5 Rows of the dataset

Viewing the Columns of the dataset

data.columns
5 columns of the data

To remove all the stop words

STOP_WORDS = nltk.corpus.stopwords.words()

Extraction of Clean_sentence of the dataset.

def clean_sentence(val):"remove chars that are not letters or numbers, downcase, then remove stop words"regex = re.compile('([^\s\w]|_)+')
sentence = regex.sub('', val).lower()
sentence = sentence.split(" ")
for word in list(sentence):
if word in STOP_WORDS:
sentence.remove(word)
sentence = " ".join(sentence)
return sentence

Drop nans, then apply ‘clean_sentence’ function to Description”

def clean_dataframe(data):"drop nans, then apply 'clean_sentence' function to Description"data = data.dropna(how="any")
for col in ['Description']:
data[col] = data[col].apply(clean_sentence)
return data

Clean_Data

data = clean_dataframe(data)
data.head(5)
Clean Data- Description

Building the corpus of the dataset — Creates a list of lists containing words from each sentence

def build_corpus(data):"Creates a list of lists containing words from each sentence"corpus = []
for col in ['Description']:
for sentence in data[col].iteritems():
word_list = sentence[1].split(" ")
corpus.append(word_list)
return corpus

View the build Corpus

corpus = build_corpus(data)
corpus[0:10]
Corpus from the dataset

Importing word2vec from genism and calculating the word-vector of the word.

model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=2, workers=4)
model.wv['luxurious']
Word Vector of luxurious

t-SNE: t-Distributed Stochastic Neighbor Embedding:

t-Distributed Stochastic Neighbor Embedding is a non-straight dimensionality decrease calculation utilized for investigating high-dimensional information. It maps multi-dimensional information to at least two measurements appropriate for human perception.

How t-SNE works?

The intuition of what and how t-SNE works.

Suppose you have a 50-dimensional data set, as it is like an impossible task for us to visualize and get a sense of it. We have to convert that 50D data set to something which we can visualize or with which we can play around. This is where t-SNE comes into the picture it converts the higher dimensional data into the lower dimensional data by following steps-

  1. It measures the similarity between the two data points and it does for every pair. Similar data points will have more value of similarity and the different data points will have less value.
  2. Then it converts that similarity distance to probability(joint probability) according to the normal distribution.
  3. As I said in the first point, it does the similarity check for every point. Thus it will have the similarity matrix `S1` for every point. This is all calculation it does for our data points that lie in higher-dimensional space.
  4. Now, t-SNE arranges all of the data points randomly on the required lower-dimensional (let’s suppose 2).
  5. And it does all of the same calculation for lower dimensional data points as it does for higher ones — calculating similarity distance but with a major difference it assigns probability according to t- distribution instead of normal distribution and this is because it is called t-SNE not simple SNE.
  6. Now we also have the similarity matrix for lower dimensional data points. Let’s call it S2.
  7. Now, what t-SNE does is it compares matrix S1 and S2 and tries to make the difference between matrix S1 and S2 much smaller by doing some complex mathematics.
  8. In the end, we will have lower-dimensional data points that try to capture even complex relationships at which PCA fails.
  9. So on a very high level, this is how t-SNE works.
def tsne_plot(model):"Creates and TSNE model and plots it"labels = []
tokens = []
for word in model.wv.vocab:
tokens.append(model[word])
labels.append(word)
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(16, 16))
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
tsne_plot(model)

Now, let’s see the more selective model:

# A more selective modelmodel1 = word2vec.Word2Vec(corpus, size=100, window=20, min_count=3, workers=4)tsne_plot(model1)
Selective Plot for the datasets Selective Plot for the dataset

The most similar words that are similar to a target word

model.most_similar('walking')
Words similar to Walking
model.most_similar('pretty')
Words similar to Pretty

Further improvements:

Training of word2vec is a very computationally expensive process. With millions of words, the training may take a lot of time. Some methods to counter this are negative sampling and Hierarchical softmax. A good link to understand both can be found here.

Hope this helps :)

Follow if you like my posts.

For more help, check my Github :- https://github.com/Afaf-Athar/Word2Vec

Additional Resources I found Useful:1. https://www.kaggle.com/harmanpreet93/train-word2vec-on-hotel-reviews-dataset
2. https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1
3.https://github.com/nltk/nltk/blob/develop/nltk/test/gensim.doctest
4. Kullback-Liebler Divergence: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
5. Good hyperparameter Information: https://distill.pub/2016/misread-tsne/
6. L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579–2605, 2008.

Please leave comments for any clarifications or questions.

Happy learning 😃

--

--

Afaf Athar
The Startup

I Do Data. I write what I wish I could have read when I was younger