What is GloVe?

GloVe is an unsupervised learning algorithm for obtaining vector representations for words.It is developed by Stanford.Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Examples for Linear substructures are:

In this blog I am going to generate word vectors on Amazon food reviews using GloVe and plot the word cloud for the word vectors.

Installing Glove-python:

Install glove library using pip command:

pip install glove_python

Loading the data:

# using the SQLite Table to read data.
con = sqlite3.connect(‘database.sqlite’)
#sorting the data by timestamp and filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
timestamp_data = pd.read_sql_query(“””
SELECT *
FROM Reviews WHERE Score != 3
ORDER BY Time
“””, con)
# Give reviews with Score>3 a 1 rating, and reviews with a score< 3 a 0 rating.
def partition(x):
if x < 3:
return 0
return 1
#changing reviews with score less than 3 to be positive and vice-versa
actualScore = timestamp_data[‘Score’]
positiveNegative = actualScore.map(partition)
timestamp_data[‘Score’] = positiveNegative

I used 250,000 points from the amazon data set out of 568,454 points(reviews)

#creating subset dataframe of 250000 points
sub_data=timestamp_data[0:250000]
sub_data.head()

Pre-processing Data and Storing in CSV File:

I have pre-processed the data i-e, removing stopwords,html tags and converting characters to lower case.

#importing necessary libraries to clean the data
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import time
import warnings
# loading stop words from nltk library
stop_words = set(stopwords.words(‘english’))
# BeautifulSoup is used to remove html tags from the text
from bs4 import BeautifulSoup
import re # For regular expressions
def nlp_preprocessing(total_text, index, column):
if type(total_text) is not int:
string = “”
# 1. Removing html tags
total_text = BeautifulSoup(total_text).get_text()
# replace every special char with space
total_text = re.sub(‘[^a-zA-Z0–9\n]’, ‘ ‘, total_text)
# replace multiple spaces with single space
total_text = re.sub(‘\s+’,’ ‘, total_text)
# converting all the chars into lower-case.
total_text = total_text.lower()

for word in total_text.split():
# if the word is a not a stop word then retain that word from the data
if not word in stop_words:
string += word + “ “

sub_data[column][index] = string

Cleaning the data:

#text processing stage.
start_time = time.clock()
for index, row in sub_data.iterrows():
nlp_preprocessing(row[‘Text’], index, ‘Text’)
print(‘Time took for preprocessing the text :’,time.clock() — start_time, “seconds”)
Time took for preprocessing the text : 19560.216840155324 seconds

Storing cleaned data in CSV file:

I stored the cleaned data in a separate csv file so that I can use the data in future to save time on preprocessing

sub_data.to_csv(‘cleanedreviews.csv’)

Loading the cleaned data into dataframe:

df=pd.read_csv(‘cleanedreviews.csv’)
df.head()

Preparing input:

We need to convert our Reviews (our input to glove ) into array/list of words.The following code converts reviews in to list of words and converts uppercase letters to lowercase.

#converting reviews in lists of words i-e, for each review a list of words will created
list_of_sent=[]
for sent in sub_data[‘Text’].values:
filtered_sentence=[]
for w in sent.split():
for cleaned_words in w.split():
if(cleaned_words.isalpha()):
filtered_sentence.append(cleaned_words.lower())
else:
continue
list_of_sent.append(filtered_sentence)

Building the GloVe Model:

First we need to generate co-occurrence matrix using the reviews.

#Using Corpus to construct co-occurrence matrix
corpus=Corpus()
corpus.fit(list_of_sent,window=5)

For better understanding about co-occurrence matrix please go through this Vector representations blog.

Next we need to fit our model using the co-occurrence matrix.

#creating a Glove object which will use the matrix created in the above lines to create embeddings
#We can set the learning rate as it uses Gradient Descent and number of components
glove = Glove(no_components=5, learning_rate=0.05)

glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save(‘glove.model’)
#After the training glove object has the word vectors for the lines we have provided. But the dictionary still resides in the corpus object.
#We need to add the dictionary to the glove object to make it complete. by using :glove.add_dictionary(corpus.dictionary)

Plotting the word cloud cluster:

Now that we constructed word vectors let’s plot the word cloud for few words to see which words are more similar.

Install wordcloud library:

pip install wordcloud

Import wordcloud to your python notebook:

#importing word cloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

For reusability lets define a function called cloud() which plots the word cloud for a given word whenever we call it.

#function to print wordcloud for different words
def cloud(c):
wordcloud = WordCloud(max_font_size=50, max_words=100,collocations=False).generate_from_frequencies(c)
plt.figure()
plt.imshow(wordcloud, interpolation=”bilinear”)
plt.axis(“off”)
plt.show()

Now let’s see some of word clouds for few words

As our dataset contains reviews about food items. I plotted the word cloud using the words by which we describe the food, to find the words that are similar to a given word.

For wordcloud documentation please refer this github blog.

Plotting the word cloud for yummy.

similar=glove.most_similar(‘yummy’,number=10)
similar
[('frutis', 0.996492443083793),
 ('tastes', 0.9944467571247136),
 ('delicious', 0.9935245182171553),
 ('vinger', 0.9925299089629726),
 ('rather', 0.9910518457238957),
 ('bit', 0.9905045859027032),
 ('melancholy', 0.9889789964338971),
 ('importantly', 0.9885956397653726),
 ('little', 0.9873806119969238)

These are the words that are similar to yummy obtained by constructing word Vectors.

Word Cloud:

#wordcloud for Yummy
words=dict(similar)
cloud(words)
Word cloud for yummy i-e the words similar in meaning to yummy

Word cloud for tasty:

Similar to yummy first I found the words similar to tasty by using function most_similar(). I am considering the top 10 words that are similar to yummy.

#tasty
words=glove.most_similar(‘tasty’,number=10)
words=dict(words)
cloud(words)
Word Cloud for tasty

Similarly I have plotted for few more words ,spicy, briyani(an Indian dish),healthy ,pizza.

#spicy
words=glove.most_similar(‘spicy’,number=10)
words=dict(words)
cloud(words)
Word cloud for spicy
#biryani
words=glove.most_similar(‘biryani’,number=10)
words=dict(words)
cloud(words)
Word cloud for word biryani
#pizza
words=glove.most_similar(‘pizza’,number=10)
words=dict(words)
cloud(words)
Word Cloud for pizza
#healthy
words=glove.most_similar(‘healthy’,number=10)
words=dict(words)
cloud(words)
Word Cloud for healthy

Conclusion:

These are word clouds I plotted for some of the words with help of GloVe .I am linking my github code here. Please feel free to comment and help in improving my blog. I am a beginner in Machine Learning.

Happy Hacking!!

References:

These are the sources I have used to work on GloVe :

  1. https://nlp.stanford.edu/projects/glove/
  2. https://medium.com/@japneet121/word-vectorization-using-glove-76919685ee0b
  3. https://medium.com/ai-society/jkljlj-7d6e699895c4
  4. https://github.com/stanfordnlp/GloVe/tree/master/src
  5. https://pypi.org/project/glove/
  6. https://textminingonline.com/getting-started-with-word2vec-and-glove-in-python
  7. https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
  8. https://www.appliedaicourse.com/