Text Analysis on the reviews data of Indian products in Amazon

Published in

Analytics Vidhya

10 min readNov 11, 2020

The objective of the article is to explore and analyze the reviews dataset of Indian products on Amazon with different NLP methodologies such as NLTK and Spacy. Also touch upon the Sentiment analysis with NLTK Vader and TextBlob.

Sentiment analysis quantify the emotional intensity of words and phrases within a text. Sentiment analysis tools will process a unit of text and output quantitative scores to indicate +ve/-ve. NLTK VADER sentiment analysis tool generates +ve, -ve and neutral sentiment scores for a given input. Sentiment analysis is essential for businesses to gauge customer response.

Text data is unstructured dataset and with various Python libraries it has become efficient to explore and analyze in depth and extract meaningful insights for business decision.

The dataset is in Kaggle https://www.kaggle.com/nehaprabhavalkar/indian-products-on-amazon

As part of the NLP analysis process the typical pipeline is Tokenization ===> Cleaning the data ===> Removing the stop words ===> BoW===> Classification model training

As the first step, we load all the required python libraries. Next we read the dataset into the system:

There are five columns asin, name, date,rating, review. Will check the dimensionality of the dataframe and null columns. We can see 4 rows where review column has null values hence we drop those four rows and reset the index:

As part of the preprocessing step, we remove all those characters which are not numbers or characters in the review field.

# remove all characters not number or characters
def cleanText(input_string):
modified_string = re.sub(‘[^A-Za-z0–9]+’, ‘ ‘, input_string)
return(modified_string)
reviews_df[‘review’] = reviews_df.review.apply(cleanText)
reviews_df[‘review’][150]
output:
‘I am writing this review after using it around 20 days It seems very natural and chemical free and is very gentle on skin But it does its job of cleaning the skin properly It contains tea tree which is one of my favourite ingredients for skin care Give it a try its definitely better than all other chemicals containing face washes And its even affordable as compared to other natural brands available in market ‘

From the name field we can extract the brand name.

reviews_df[‘brandName’] = reviews_df[‘brandName’].str.title()
reviews_df.brandName.unique()
array(['Mamaearth', 'Godrej', 'Titan', 'Maaza', 'Paper', 'Indiana',
'Coca', 'Natural', 'Maggi', 'Glucon', 'Amul', 'Patanjali',
'Dettol', 'Savlon', 'Cinthol', 'Britannia', 'Nutrichoice',
'Streax', 'Himalaya', 'Society', 'Tata', 'Fastrack', 'Reflex',
'Mysore'], dtype=object)

Distribution of the rating column. We can see that maximum reviews correspond to rating 5.

`#distribution of rating
sns.countplot(x=’rating’, data=reviews_df)

Review counts and brands

Will do the text analysis with NLTK and Vader Sentiment analyzer.

Gives a sentiment intensity score to review sentences. The results of polarity_scores gives us numerical values for negative, neutral, and positive sentiments. The compound value reflects the overall sentiment ranging from -1 being very negative and +1 being very positive.

Text Preprocessing
#converting to lower case
reviews_df[‘clean_review_text’]=reviews_df[‘review’].str.lower()
#removing punctuations
reviews_df[‘clean_review_text’]=reviews_df[‘clean_review_text’].str.translate(str.maketrans(‘’,’’,string.punctuation))
stopWords=stopwords.words(‘english’)+[‘the’, ‘a’, ‘an’, ‘i’, ‘he’, ‘she’, ‘they’, ‘to’, ‘of’, ‘it’, ‘from’]
def removeStopWords(stopWords, rvw_txt):
newtxt = ‘ ‘.join([word for word in rvw_txt.split() if word not in stopWords])
return newtxt
reviews_df[‘clean_review_text’] = [removeStopWords(stopWords,x) for x in reviews_df[‘clean_review_text’]]

nltk.download(‘vader_lexicon’)
sentiment_model = SentimentIntensityAnalyzer()
sentiment_scores=[]
sentiment_score_flag = []
for text in reviews_df[‘clean_review_text’]:
sentimentResults = sentiment_model.polarity_scores(text)
sentiment_score = sentimentResults[“compound”]
#print(sentimentResults)
#The compound value reflects the overall sentiment ranging from -1 being very negative and +1 being very positive.
sentiment_scores.append(sentiment_score)
# marking the sentiments as positive, negative and neutral
if sentimentResults[‘compound’] >= 0.05 :
sentiment_score_flag.append(‘positive’)
elif sentimentResults[‘compound’] <= — 0.05 :
sentiment_score_flag.append(‘negative’)
else :
sentiment_score_flag.append(‘neutral’)
reviews_df[‘scores’]=sentiment_scores
reviews_df[‘scoreStatus’] = sentiment_score_flag
reviews_df.head()

Wordcloud displaying top words from the review like hair,oil,product,healthy etc

Wordcloud for positive reviews shows words like best, soothing, good etc

Wordcloud for negative reviews displays words like worst, fake, bad etc

Brand wise total positive and negative reviews. Only for the brand Indiana we can see no negative reviews.

The 6 reviews for “Indiana” brand are as below and we see them as positive.

TF and TF-IDF and Bag of Words

For machine learning model the input has to be numeric hence to represent our text numerically we have the Bag of Words model like TF, TFIDF

TF-IDF aims to quantify the importance of a given word relative to other words in the document and in the corpus.

One tool we can use for doing this is called Bag of Words. BoW is a classical text representation technique. The text under consideration as a collection of words while ignoring the order and context. BoW converts text into the matrix of occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.

Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph.

Stemming is to Normalize words into its root form.

Lemmatization is similar to Stemming where we fetch the base or root form for a word.

POS tagging is the process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging or POS-tagging.

CountVectorizer use BoW by converting collection of text documents into matrix of token counts. “fit” will apply CountVectorizer to the training data converting collection of text documents into a matrix of token counts.

features = CountVectorizer()
features.fit(reviews_df[“clean_review_text”])

{'bought': 615, 'hair': 1916, 'oil': 2879, 'viewing': 4476, 'many': 2541, 'good': 1842, 'comments': 917, 'product': 3235, 'enough': 1454, 'first': 1645, 'expensive': 1521, 'second': 3650, 'thing': 4207, 'amount': 322, 'low': 2479, 'half': 1925, 'bottle': 611, 'yes': 4671, 'completely': 946,

“transform” will transform the input document to a document term matrix which gives the BoW representation. Each row is a document and each column is a word from training vocabulary.

bagofWords = features.transform(reviews_df[“clean_review_text”])

(0, 322)	1
  (0, 611)	2
  (0, 615)	1
  (0, 626)	2
  (0, 698)	1
  (0, 805)	1
  (0, 917)	1

print(bagofWords.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...

print(features.get_feature_names())

'aamras', 'abandon', 'abck', 'able', 'abroad', 'absence', 'absolute', 'absolutely', 'absorbed', 'absorbing', 'abt', 'accept', 'accepted', 'access', 'accessibility', 'accompanied', 'accompaniment', 'according', 'account', 'accounts', 'accurate', 'accustomed', 'acetate', 'acid', 'acidic', 'acne', 'acnei', 'acnes', 'across', 'action', 'actions', 'active', 'actives', 'actual', 'actually', 'ad', 'add', 'added', 'addicted', 'adding', 'addition', 'additional', 'additives', 'address', 'adds', 'adequate', 'adjust', 'admit', 'adrak', 'adult', 'adulterated', 'adulteration', 'adultery', 'adults', 'advantage', 'adverse', 'adversely', 'advertised'

Building the training and test data by combining all the positive and all the negative rows. Logistic Regression for training the classification model.

How the trained model works on new data. great and bad rightly predicted but the model wrongly predicts “sucks” and “not good”

Top tokens like product, good, nice etc

tokenized_word=word_tokenize((reviews_df[‘clean_review_text’].to_string()))
#Frequency Distribution
fdist = FreqDist(tokenized_word)
# Frequency Distribution Plot
fdist.plot(30,cumulative=False)
plt.show()

Next will do the review text analysis with spacy library.

Photo by Malvestida Magazine on Unsplash

Tokenizing the Text

Tokenization is the process of breaking text into pieces, called tokens, and ignoring characters like punctuation marks (,. “ ‘) and spaces. spaCy‘s tokenizer takes input in form of unicode text and outputs a sequence of token objects.

import spaCy and its English-language model. Assign the review text string to text. Using nlp(text), will process that text in spaCy and assign the result to a variable called doc. spaCy stores tokenized text as a doc.

#Stopwords
stopwords=spacy.lang.en.stop_words.STOP_WORDS
stopWords = list(stopwords)
len(stopWords)
for token in doc:
if token.is_stop == False:
print(token)
oily
acne
prone
skin
product
days
acne
dried
significant
improvement
dark
spots
fragrance
mild
brilliant

Lemmatization

Finding the roots of all the words using spaCy lemmatization.

displacy.render method Render a dependency parse tree or named entity visualization. NER with spacy can be done with the ent attributes of the doc object.

Downloaded the large pre trained model

import en_core_web_lg
nlp = en_core_web_lg.load()
doc=nlp(text)
for token in doc:
print(token.text,’ — →’,token.has_vector)
I ----> True
have ----> True
oily ----> True
acne ----> True
prone ----> True
skin ----> True
and ----> True
have ----> True
been ----> True
using ----> True
this ----> True
product ----> True
for ----> True
a ----> True

Similarity score

Combining positive and negative datasets to build the training data for training a classification model

df = pd.concat([positiveReviews_df,negativeReviews_df])
df = df[[“clean_review_text”,”scoreStatus”]]
df[‘scoreStatus’] = (df[‘scoreStatus’] == ‘positive’)*1
#Tokenization
punct = string.punctuation
print(punct)
def cleanText(sent):
doc = nlp(sent)
tokens = []
for token in doc:
if token.lemma != “-PRON-”:
tokens.append(token.lemma_.lower().strip())
else:
tokens.append(token.lemma_)

cleanTokens = []
for token in tokens:
if token not in stopWords and token not in punct:
cleanTokens.append(token)
return cleanTokens
#TFIDF
tfidf = TfidfVectorizer(tokenizer = cleanText)
classifier = LinearSVC()
X = df[“clean_review_text”]
y = df[“scoreStatus”]
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)

Training the model with SVM

Analyzing with TextBlob

polarity varies from -1 (negative) to +1 (positive)

subjectivity varies from 0 to 1

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
reviews_df[“polarity”] = reviews_df[“review”].apply(pol)
reviews_df[“subjectivity”] = reviews_df[“review”].apply(sub)

print(“negative reviews”)
most_negative = reviews_df[reviews_df.polarity == -1].review.head()
print(most_negative)
print(“positive reviews”)
most_positive = reviews_df[reviews_df.polarity == 1].review.head()
print(most_positive)
negative reviews
346 Worst product Product is not working after 10d...
356 Worst product Product is not working after 10d...
1566 taste is terrible
1572 taste is terrible
Name: review, dtype: object
positive reviews
384 Best Product in these price segment
394 Best Product in these price segment
402 After 1 week of use Seems perfect 10 10 for lo...
407 Simply awesome
412 After 1 week of use Seems perfect 10 10 for lo...
Name: review, dtype: object

Text analysis with gensim and word2vec

Word2Vec

Word2Vec actually takes the semantic meaning of the words and their relationships between other words. It learns all the internal relationships between the words.It represents the word in dense vector form.

sentences = reviews_df[‘review_tokens’][1:10]
sentences

1    [used, mama, earth, newly, launched, onion, oi...
2    [bad, product, hair, falling, increase, much, ...
3    [product, smells, similar, navarathna, hair, o...
4    [trying, different, onion, oil, hair, hair, he...
5    [using, product, time, roommate, planning, ord...
6    [purchased, oil, shampoo, watching, fake, yout...
7    [good, product, mamaearth, oil, gives, hair, f...
8    [showing, onion, oil, benefits, ad, ate, givin...
9    [used, one, time, say, hairfall, control, stop...
Name: review_tokens, dtype: object

Using Gensim’s library we have Word2Vec which takes parameters like min_count = 1 considers only if word repeats more than 1 time in entire data.

#train model
model = Word2Vec(sentences, min_count=1)
print(model)

Word2Vec(vocab=229, size=100, alpha=0.025)

#vocab
words=list(model.wv.vocab)
print(words)

'used', 'mama', 'earth', 'newly', 'launched', 'onion', 'oil', 'twice', 'must', 'say', 'im', 'already', 'impressed', 'results', 'prevents', 'hair', 'loss', 'helps', 'control', 'premature', 'greying', 'dryness', 'dandruff', 'scalp', 'eruptions', 'many', 'problems', 'regular', 'use', 'avoid', 'dry', 'frizzy', 'make', 'sure', 'hairs', 'week', 'oiling', 'provides', 'essential', 'nutrients', 'also', 'strengthens', 'roots', 'mamaearth', 'works', 'best', 'seasons', 'bad', 'product', 'falling', 'increase', 'much', 'order', 'shampoo', 'mask', 'nothing', 'stop', 'hairfallafter', '3', '4', 'wash', 'badly', 'smells', 'similar', 'navarathna', 'strong', 'sticky', 'applying', 'three', 'drops', 'review', 'usage', '2', 'months1', 'worst', 'product2', 'fall', 'increased', 'lot3', 'brought', 'watching', 'youtube', 'influencer', 'mumbaiker', 'nikhil4', 'totally', 'misguided', 'never', 'take', 'suggestions', 'influencers', '5', 'using', 'since', 'months', 'result', 'losing', 'hair6', 'wasted', 'money', 'well', 'damaged', 'hair7', 'better', 'provide',

reviewsText = reviews_df.clean_review_text.values
reviewsVec = [nltk.word_tokenize(review) for review in reviewsText]
len(reviewsVec)

2778

Using the trained model to find the similarity of certain word.

Conclusion

We went through different techniques for encoding text data into numerical vectors and training classification model. But which technique is appropriate for our machine learning model depends on the structure of the data and the business problem.