Word Embedding using Python

Jeril Kuriakose
Analytics Vidhya
Published in
4 min readFeb 18, 2020

--

In this post we will see how to generate word embedding’s and plot a chart for the corresponding words.

Dependencies

- NLTK
- Sklearn
- Gensim
- Plotly
- Pandas

First we need to get a paragraph or text for which we need to find the embeddings, I took a pragraph from here for this post.

paragraph = '''Jupiter is the fifth planet from the Sun and the largest in the Solar System. 
It is a gas giant with a mass one-thousandth that of the Sun,
but two-and-a-half times that of all the other planets in the Solar System combined.
Jupiter is one of the brightest objects visible to the naked eye in the night sky,
and has been known to ancient civilizations since before recorded history.
It is named after the Roman god Jupiter. When viewed from Earth,
Jupiter can be bright enough for its reflected light to cast shadows,
and is on average the third-brightest natural object in the night sky after the Moon and Venus.'''

Next we need to tokenize the text, so we use the nltk library to get it done:

import nltk# tokeninizing the paragraph
sent_text = nltk.sent_tokenize(paragraph)
word_text = [nltk.word_tokenize(sent) for sent in sent_text]
print(word_text)

We will get a 2D array after the tokenizing, and will look like the following:

[['Jupiter', 'is', 'the', 'fifth', 'planet', 'from', 'the', 'Sun', 'and', 'the', 'largest', 'in', 'the', 'Solar', 'System', '.'], ['It', 'is', 'a', 'gas', 'giant', 'with', 'a', 'mass', 'one-thousandth', 'that', 'of', 'the', 'Sun', ',', 'but', 'two-and-a-half', 'times', 'that', 'of', 'all', 'the', 'other', 'planets', 'in', 'the', 'Solar', 'System', 'combined', '.'], ['Jupiter', 'is', 'one', 'of', 'the', 'brightest', 'objects', 'visible', 'to', 'the', 'naked', 'eye', 'in', 'the', 'night', 'sky', ',', 'and', 'has', 'been', 'known', 'to', 'ancient', 'civilizations', 'since', 'before', 'recorded', 'history', '.'], ['It', 'is', 'named', 'after', 'the', 'Roman', 'god', 'Jupiter', '.'], ['When', 'viewed', 'from', 'Earth', ',', 'Jupiter', 'can', 'be', 'bright', 'enough', 'for', 'its', 'reflected', 'light', 'to', 'cast', 'shadows', ',', 'and', 'is', 'on', 'average', 'the', 'third-brightest', 'natural', 'object', 'in', 'the', 'night', 'sky', 'after', 'the', 'Moon', 'and', 'Venus', '.']]

Now we will use the gensim package to get the word embedding’s from Word2Vec model.

from gensim.models import Word2Vec# train model to get the embeddings
model = Word2Vec(word_text, min_count=1)

To plot the word embeddings we need to first convert the multi-dimensional embedding’s to a 2D array. So to convert it to 2D array we use PCA

# getting the embedding vectors
X = model[model.wv.vocab]
# dimentionality reduction using PCA
pca = PCA(n_components=2)
# running the transformations
result = pca.fit_transform(X)
# getting the corresponding words
words = list(model.wv.vocab)

We need to do some processing to convert the PCA results to dataframe and is as follows:

import pandas as pd# creating a dataframe from the results
df = pd.DataFrame(result, columns=list('XY'))
# adding a columns for the corresponding words
df['Words'] = words
# converting the lower case text to title case
df['Words'] = df['Words'].str.title()

After getting the required array’s we can plot the chart using plotly

import plotly.express as px# plotting a scatter plot
fig = px.scatter(df, x="X", y="Y", text="Words", log_x=True, size_max=60)
# adjusting the text position
fig.update_traces(textposition='top center')
# setting up the height and title
fig.update_layout(
height=600,
title_text='Word embedding chart'
)
# displaying the figure
fig.show()

Now the word embedding chart will look like the following:

word embedding chart

The entire code is as follows:

import nltk
import pandas as pd
import plotly.express as px
from gensim.models import Word2Vec
paragraph = '''Jupiter is the fifth planet from the Sun and the largest in the Solar System.
It is a gas giant with a mass one-thousandth that of the Sun,
but two-and-a-half times that of all the other planets in the Solar System combined.
Jupiter is one of the brightest objects visible to the naked eye in the night sky,
and has been known to ancient civilizations since before recorded history.
It is named after the Roman god Jupiter. When viewed from Earth,
Jupiter can be bright enough for its reflected light to cast shadows,
and is on average the third-brightest natural object in the night sky after the Moon and Venus.'''
# tokeninizing the paragraph
sent_text = nltk.sent_tokenize(paragraph)
word_text = [nltk.word_tokenize(sent) for sent in sent_text]
# train model to get the embeddings
model = Word2Vec(word_text, min_count=1)
# getting the embedding vectors
X = model[model.wv.vocab]
# dimentionality reduction using PCA
pca = PCA(n_components=2)
# running the transformations
result = pca.fit_transform(X)
# getting the corresponding words
words = list(model.wv.vocab)
# creating a dataframe from the results
df = pd.DataFrame(result, columns=list('XY'))
# adding a columns for the corresponding words
df['Words'] = words
# converting the lower case text to title case
df['Words'] = df['Words'].str.title()
# plotting a scatter plot
fig = px.scatter(df, x="X", y="Y", text="Words", log_x=True, size_max=60)
# adjusting the text position
fig.update_traces(textposition='top center')
# setting up the height and title
fig.update_layout(
height=600,
title_text='Word embedding chart'
)
# displaying the figure
fig.show()

Happy Coding !!!

--

--