What is GloVe?
GloVe is an unsupervised learning algorithm for obtaining vector representations for words.It is developed by Stanford.Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Examples for Linear substructures are:
GloVe: Global Vectors for Word Representationnlp.stanford.edu
In this blog I am going to generate word vectors on Amazon food reviews using GloVe and plot the word cloud for the word vectors.
Install glove library using pip command:
pip install glove_python
Loading the data:
# using the SQLite Table to read data.
con = sqlite3.connect(‘database.sqlite’)
#sorting the data by timestamp and filtering only positive and negative reviews i.e.
# not taking into consideration those reviews with Score=3
timestamp_data = pd.read_sql_query(“””
FROM Reviews WHERE Score != 3
ORDER BY Time
# Give reviews with Score>3 a 1 rating, and reviews with a score< 3 a 0 rating.
if x < 3:
#changing reviews with score less than 3 to be positive and vice-versa
actualScore = timestamp_data[‘Score’]
positiveNegative = actualScore.map(partition)
timestamp_data[‘Score’] = positiveNegative
I used 250,000 points from the amazon data set out of 568,454 points(reviews)
#creating subset dataframe of 250000 points
Pre-processing Data and Storing in CSV File:
I have pre-processed the data i-e, removing stopwords,html tags and converting characters to lower case.
#importing necessary libraries to clean the data
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
# loading stop words from nltk library
stop_words = set(stopwords.words(‘english’))
# BeautifulSoup is used to remove html tags from the text
from bs4 import BeautifulSoup
import re # For regular expressions
def nlp_preprocessing(total_text, index, column):
if type(total_text) is not int:
string = “”
# 1. Removing html tags
total_text = BeautifulSoup(total_text).get_text()
# replace every special char with space
total_text = re.sub(‘[^a-zA-Z0–9\n]’, ‘ ‘, total_text)
# replace multiple spaces with single space
total_text = re.sub(‘\s+’,’ ‘, total_text)
# converting all the chars into lower-case.
total_text = total_text.lower()
for word in total_text.split():
# if the word is a not a stop word then retain that word from the data
if not word in stop_words:
string += word + “ “
sub_data[column][index] = string
Cleaning the data:
#text processing stage.
start_time = time.clock()
for index, row in sub_data.iterrows():
nlp_preprocessing(row[‘Text’], index, ‘Text’)
print(‘Time took for preprocessing the text :’,time.clock() — start_time, “seconds”)
Time took for preprocessing the text : 19560.216840155324 seconds
Storing cleaned data in CSV file:
I stored the cleaned data in a separate csv file so that I can use the data in future to save time on preprocessing
Loading the cleaned data into dataframe:
We need to convert our Reviews (our input to glove ) into array/list of words.The following code converts reviews in to list of words and converts uppercase letters to lowercase.
#converting reviews in lists of words i-e, for each review a list of words will created
for sent in sub_data[‘Text’].values:
for w in sent.split():
for cleaned_words in w.split():
Building the GloVe Model:
First we need to generate co-occurrence matrix using the reviews.
#Using Corpus to construct co-occurrence matrix
For better understanding about co-occurrence matrix please go through this Vector representations blog.
Next we need to fit our model using the co-occurrence matrix.
#creating a Glove object which will use the matrix created in the above lines to create embeddings
#We can set the learning rate as it uses Gradient Descent and number of components
glove = Glove(no_components=5, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
#After the training glove object has the word vectors for the lines we have provided. But the dictionary still resides in the corpus object.
#We need to add the dictionary to the glove object to make it complete. by using :glove.add_dictionary(corpus.dictionary)
Plotting the word cloud cluster:
Now that we constructed word vectors let’s plot the word cloud for few words to see which words are more similar.
Install wordcloud library:
pip install wordcloud
Import wordcloud to your python notebook:
#importing word cloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
For reusability lets define a function called cloud() which plots the word cloud for a given word whenever we call it.
#function to print wordcloud for different words
wordcloud = WordCloud(max_font_size=50, max_words=100,collocations=False).generate_from_frequencies(c)
Now let’s see some of word clouds for few words
As our dataset contains reviews about food items. I plotted the word cloud using the words by which we describe the food, to find the words that are similar to a given word.
For wordcloud documentation please refer this github blog.
Plotting the word cloud for yummy.
These are the words that are similar to yummy obtained by constructing word Vectors.
#wordcloud for Yummy
Word cloud for tasty:
Similar to yummy first I found the words similar to tasty by using function most_similar(). I am considering the top 10 words that are similar to yummy.
Similarly I have plotted for few more words ,spicy, briyani(an Indian dish),healthy ,pizza.
These are word clouds I plotted for some of the words with help of GloVe .I am linking my github code here. Please feel free to comment and help in improving my blog. I am a beginner in Machine Learning.
These are the sources I have used to work on GloVe :