What is GloVe?

GloVe is an unsupervised learning algorithm for obtaining vector representations for words.It is developed by Stanford.Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Examples for Linear substructures are:

In this blog I am going to generate word vectors on Amazon food reviews using GloVe and plot the word cloud for the word vectors.

Installing Glove-python:

Install glove library using pip command:

pip install glove_python

Loading the data:

# using the SQLite Table to read data.
con = sqlite3.connect(‘database.sqlite’)
#sorting the data by timestamp and filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
timestamp_data = pd.read_sql_query(“””
FROM Reviews WHERE Score != 3
“””, con)
# Give reviews with Score>3 a 1 rating, and reviews with a score< 3 a 0 rating.
def partition(x):
if x < 3:
return 0
return 1
#changing reviews with score less than 3 to be positive and vice-versa
actualScore = timestamp_data[‘Score’]
positiveNegative = actualScore.map(partition)
timestamp_data[‘Score’] = positiveNegative

I used 250,000 points from the amazon data set out of 568,454 points(reviews)

#creating subset dataframe of 250000 points

Pre-processing Data and Storing in CSV File:

I have pre-processed the data i-e, removing stopwords,html tags and converting characters to lower case.

#importing necessary libraries to clean the data
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import time
import warnings
# loading stop words from nltk library
stop_words = set(stopwords.words(‘english’))
# BeautifulSoup is used to remove html tags from the text
from bs4 import BeautifulSoup
import re # For regular expressions
def nlp_preprocessing(total_text, index, column):
if type(total_text) is not int:
string = “”
# 1. Removing html tags
total_text = BeautifulSoup(total_text).get_text()
# replace every special char with space
total_text = re.sub(‘[^a-zA-Z0–9\n]’, ‘ ‘, total_text)
# replace multiple spaces with single space
total_text = re.sub(‘\s+’,’ ‘, total_text)
# converting all the chars into lower-case.
total_text = total_text.lower()

for word in total_text.split():
# if the word is a not a stop word then retain that word from the data
if not word in stop_words:
string += word + “ “

sub_data[column][index] = string

Cleaning the data:

#text processing stage.
start_time = time.clock()
for index, row in sub_data.iterrows():
nlp_preprocessing(row[‘Text’], index, ‘Text’)
print(‘Time took for preprocessing the text :’,time.clock() — start_time, “seconds”)
Time took for preprocessing the text : 19560.216840155324 seconds

Storing cleaned data in CSV file:

I stored the cleaned data in a separate csv file so that I can use the data in future to save time on preprocessing


Loading the cleaned data into dataframe:


Preparing input:

We need to convert our Reviews (our input to glove ) into array/list of words.The following code converts reviews in to list of words and converts uppercase letters to lowercase.

#converting reviews in lists of words i-e, for each review a list of words will created
for sent in sub_data[‘Text’].values:
for w in sent.split():
for cleaned_words in w.split():

Building the GloVe Model:

First we need to generate co-occurrence matrix using the reviews.

#Using Corpus to construct co-occurrence matrix

For better understanding about co-occurrence matrix please go through this Vector representations blog.

Next we need to fit our model using the co-occurrence matrix.

#creating a Glove object which will use the matrix created in the above lines to create embeddings
#We can set the learning rate as it uses Gradient Descent and number of components
glove = Glove(no_components=5, learning_rate=0.05)

glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
#After the training glove object has the word vectors for the lines we have provided. But the dictionary still resides in the corpus object.
#We need to add the dictionary to the glove object to make it complete. by using :glove.add_dictionary(corpus.dictionary)

Plotting the word cloud cluster:

Now that we constructed word vectors let’s plot the word cloud for few words to see which words are more similar.

Install wordcloud library:

pip install wordcloud

Import wordcloud to your python notebook:

#importing word cloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

For reusability lets define a function called cloud() which plots the word cloud for a given word whenever we call it.

#function to print wordcloud for different words
def cloud(c):
wordcloud = WordCloud(max_font_size=50, max_words=100,collocations=False).generate_from_frequencies(c)
plt.imshow(wordcloud, interpolation=”bilinear”)

Now let’s see some of word clouds for few words

As our dataset contains reviews about food items. I plotted the word cloud using the words by which we describe the food, to find the words that are similar to a given word.

For wordcloud documentation please refer this github blog.

Plotting the word cloud for yummy.

[('frutis', 0.996492443083793),
 ('tastes', 0.9944467571247136),
 ('delicious', 0.9935245182171553),
 ('vinger', 0.9925299089629726),
 ('rather', 0.9910518457238957),
 ('bit', 0.9905045859027032),
 ('melancholy', 0.9889789964338971),
 ('importantly', 0.9885956397653726),
 ('little', 0.9873806119969238)

These are the words that are similar to yummy obtained by constructing word Vectors.

Word Cloud:

#wordcloud for Yummy
Word cloud for yummy i-e the words similar in meaning to yummy

Word cloud for tasty:

Similar to yummy first I found the words similar to tasty by using function most_similar(). I am considering the top 10 words that are similar to yummy.

Word Cloud for tasty

Similarly I have plotted for few more words ,spicy, briyani(an Indian dish),healthy ,pizza.

Word cloud for spicy
Word cloud for word biryani
Word Cloud for pizza
Word Cloud for healthy


These are word clouds I plotted for some of the words with help of GloVe .I am linking my github code here. Please feel free to comment and help in improving my blog. I am a beginner in Machine Learning.

Happy Hacking!!


