Topic Modeling with Amazon Reviews

Published in

Analytics Vidhya

5 min readSep 14, 2019

What is Topic Modeling?

Topic modeling can be described as a method of finding a topic from the collection of documents that best represents the information in those documents. With this approach, you can discover hidden patterns, annotate your documents and summarize a bunch of documents.

The Data

The data set we’ll use is amazon reviews on electronics. This dataset contains product reviews spanning from May 1996 -June 2014. The dataset includes review text, ratings, helpfulness votes. For the following approach, we would be using reviews only.

Data Pre-Processing falls under an important task when dealing with text data. Following are the steps to process the data and then create a bag of words to fit a Latent Dirichlet allocation (LDA)

Step 1: Extract Data-This is the most easy step, I am using the code provided by Julian McAuley, UCSD on the webpage: Amazon product data

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')df = getDF(r'NLP\reviews_Electronics_5.json.gz')
df.head(10)
df.columns

Step 2: Tokenizer, remove stop words & lower case doc- We are going to use Regexptokenizer to split all the sentences into words and then lowercase them while removing punctuation. We Select a document to preview results

#Regular expression tokenizer
tokenizer = RegexpTokenizer(r'\w+')
doc_1 = df.reviewText[0]# Using one review
tokens = tokenizer.tokenize(doc_1.lower())print('{} characters in string vs {} words in a list'.format(len(doc_1),                                                             len(tokens)))
print(tokens[:10])nltk_stpwd = stopwords.words('english')print(len(set(nltk_stpwd)))
print(nltk_stpwd[:10])stopped_tokens = [token for token in tokens if not token in nltk_stpwd]
print(stopped_tokens[:10])

Step 3: Stemming using Snowball stemmer- Notice the words after application of snowball stemmer and to know more about snowball stemmer check out the website.

sb_stemmer = SnowballStemmer('english')
stemmed_tokens = [sb_stemmer.stem(token) for token in stopped_tokens]
print(stemmed_tokens)
"""
This is how results looked:'normal', 'receiv', 'review', 'sampl', 'thorough', 'evalu', 'write', 'review', 'within', 'two', 'day', 'one', 'took', 'longer', 'reason', 'took', 'hear', 'differ', 'model', 'versus', 'thebrainwavz', 'ear', 'headphon', 'also', 'impress', 'also', 'go', 'pile', 'album', 'tri', 'understand', 'product', 'descript', 'meant', 'smoother', 'bass', 'final', 'found', 'album', 'allow', 'hear', 'portrait', 'jazz', 'scott', 'lafaro', 'bass', 'came', 'autumn', 'leav', 'start', 'switch', 'back', 'forth','also', 'haul', 'studio'"""

Step 3A: All the above steps to all of the data- In order to create an LDA model we’ll need to put the pre-processing steps from above together to create a list of documents (list of lists) and then generate a document-term matrix (unique terms as rows, reviews as columns). This matrix tells us how frequently each term occurs with each individual document.

num_reviews = df.shape[0]doc_set = [df.reviewText[i] for i in range(num_reviews)]texts = []for doc in doc_set:
    tokens = tokenizer.tokenize(doc.lower())
    stopped_tokens = [token for token in tokens if not token in nltk_stpwd]
    stemmed_tokens = [sb_stemmer.stem(token) for token in stopped_tokens]
    texts.append(stemmed_tokens)# Adds tokens to new list "texts"
    
print(texts[1])

Step 4: Create a dictionary using corpora

Gensim’s Dictionary method encapsulates the mapping between normalized words and their integer ids. Link for more! Also, note the step in the following code where we can assess the mapping between words and their ids and to do this we use token2id approach.

texts_dict = corpora.Dictionary(texts)
texts_dict.save('elec_review.dict') 
print(texts_dict)#Assess the mapping between words and their ids we use the token2id #method:
print("IDs 1 through 10: {}".format(sorted(texts_dict.token2id.items(), key=operator.itemgetter(1), reverse = False)[:10]))#Here we assess how many reviews have word complaint in it
complaints = df.reviewText.str.contains("complaint").value_counts()
ax = complaints.plot.bar(rot=0)"""
Attempting to see what happens if we ignore tokens that appear in less 
than 30 documents or more than 20% documents.
"""texts_dict.filter_extremes(no_below=20, no_above=0.10) 
print(sorted(texts_dict.token2id.items(), key=operator.itemgetter(1), reverse = False)[:10])

Step 5: Converting the dictionary to the bag of words calling it corpus-

The bag-of-words format is (list of (token_id, token_count) tuples). The length of the corpus is 1689188

# Step 5: Converting the dictionary to bag of words calling it corpus here
corpus = [texts_dict.doc2bow(text) for text in texts]
len(corpus)#Save a corpus to disk in the sparse coordinate Matrix Market format in a serialized format instead of random
gensim.corpora.MmCorpus.serialize('amzn_elec_review.mm', corpus)

The number of topics is random, and can be determined based on the categories in which amazon typically places its products:
1. Computer — Accessories
2. TV & Video
3. Cell Phones & Accessories
4. Photography & Videography
5. Home Audio
6. Amazon devices
7. Headphones
8. Office Electronics
9. Office supplies
10. Smart Home
11. Musical Instruments
12. Video Games

Step 6: Fitting the LDA model to assess the topics

We understand that LDA is an unsupervised machine learning approach but what is it?

Latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document

#Step 6: Fit LDA model
lda_model = gensim.models.LdaModel(corpus,alpha='auto', num_topics=5,id2word=texts_dict, passes=20)#Choosing the number of topics based on various categories of electronics on Amazon
lda_model.show_topics(num_topics=5,num_words=5)raw_query = 'portable speaker'query_words = raw_query.split()
query = []
for word in query_words:
    # ad-hoc reuse steps from above
    q_tokens = tokenizer.tokenize(word.lower())
    q_stopped_tokens = [word for word in q_tokens if not word in nltk_stpwd]
    q_stemmed_tokens = [sb_stemmer.stem(word) for word in q_stopped_tokens]
    query.append(q_stemmed_tokens[0])
    
print(query)# Words in query will be converted to ids and frequencies  
id2word = gensim.corpora.Dictionary()
_ = id2word.merge_with(texts_dict) # garbage# Convert this document into (word, frequency) pairs
query = id2word.doc2bow(query)
print(query)#Create a sorted list
sorted_list = list(sorted(lda_model[query], key=lambda x: x[1]))
sorted_list#Assessing least related topics
lda_model.print_topic(a[0][0]) #least related#Assessing most related topics
lda_model.print_topic(a[-1][0]) #most related"""'0.025*"speaker" + 0.015*"headphon" + 0.013*"music" + 0.013*"bluetooth" + 0.009*"phone" + 0.009*"ear" + 0.009*"8217" + 0.009*"volum" + 0.009*"audio" + 0.008*"pair"'"""

Final step: Above are the top 5 words associated with 1 topic. The float next to each word is the weight showing how much the given word influences this specific topic. We can interpret that here the topic might be close to Amazon’s headphones category which has various sub-categories: “In Ear Earbud Headphones, Over-Ear Headphones, On-Ear Headphones, Bluetooth Headphones, Sports and Fitness Headphones, Noise-cancelling Headphones”

Source code can be found on Github.

Topic Modeling with Amazon Reviews

The Data

Written by Anjali Sunil Khushalani