Sentiment analysis on smartphones reviews (LSTM model)

Siddhartha Malladi
Analytics Vidhya
Published in
5 min readDec 1, 2020

--

I have recently learned NLP as a part of my Deep learning course.

So, I decided to write a blog on Sentiment analysis on smartphone reviews which are scrapped from amazon.in, which would be very useful to beginners to explore NLP domain.

In this blog we will use a LSTM model to train the the text and it is a many to one model which inputs 2 or more values and outputs a value.

Contents-

  1. Scrapping reviews
  2. preprocessing and EDA
  3. Training the NLP model

Technologies used- Python, Tensorflow, Seaborn, BeautifulSoup

Scrapping Reviews

I have used Beautifulsoup for scraping reviews.

First we have to get ASIN numbers of all smartphones we want.

I wrote some helping functions :

getAmazonSearch: take search query and page number to return HTML page

Searchasin: take ASIN number to return Product page

SearchReviews: take All reviews link to return Reviews page of product

Function to extract ASIN numbers:

data_asin[:5]
output:['B07SDPJ4XJ', 'B089MQ622N', 'B07X4R63DF', 'B07WPVLKPW', 'B086KCCMCP']

Then, By passing the data-asin numbers, we will go to the product page and get “see all reviews” link.

Using these “see all reviews” link and setting page number will we scrap all the reviews (mobile name, review title, body, stars ) and save them to a CSV file.

Now, we completed Scrapping reviews. Next to preprocess and visualize the information.

https://github.com/msiddhu/sentiment-analysis_on_phone-reviews/blob/main/reviews-scraping.ipynb

Preprocessing and EDA

Now, we have to do data cleaning.

The data contains noise like emojis, numbers, frequently used words(is,the, for), blank spaces. We have to clean them and convert all sentences to lower case characters for training to be done easily.

Example:

RAW DATA:Nice phone from Samsung in this price. Display is good . Camera is not awesome but average. Battery will last 1 day with normal usage. N it has all necessary features. I got this for 8999 . So good phone under 10 k.As samsung so last for years.

FILTERED: nice phone samsung price display good camera awesome average battery last 1 day normal usage necessary features got good phone 10 kas samsung last years

Using this filter_text() function we can clean all the data

Lets, visualize which words in the text are most commonly used:

Most common words in the reviews

By this we could see that most commonly used words like battery life,value money ,camera quality, don’t buy are some crucial bigrams for deciding the rating of smartphone.

Plot a graph for distribution of ratings:

distribution of ratings

Rating 1 or 2 — Negative

Rating 3 — Neutral

Rating 4 or 5 — Positive

https://github.com/msiddhu/sentiment-analysis_on_phone-reviews/blob/main/preprocess-and-eda.ipynb

Training the LSTM model for Sentiment analysis

And then download GloVe vectors which are pre trained on large text corpus and provide word-word co-occurrence. Simply, the words with same meaning or words provide similar conclusion have similar GloVe vectors.

We are using GloVe vec of 50-dimensions which are downloaded from kaggle.

Steps for preparing,

  1. Read csv file
  2. Split the data into train data and test data.
  3. Find the avg. length of sentences.
  4. Define HyperParameters

Now, we have to create an embedding layer to convert the sentences into number vectors

The functions pretrained_embedding_layer, sentences_to_indices, read_glove_vecs are taken from Sequence Models course of Coursera.

h1_func

Output:

the index of good in the vocabulary is 164328 the 50030

the word in the vocabulary is al-gama

And the sentences_to_indices function,

Input:

“the phone is good”,

“very bad”

“no star rating”

Output:

[357266, 283483, 192973,164328.],

[377946, 65963]

[262350, 341678, 301038]

Build the 2-Layer LSTM model using keras which is very easy comparatively than other frameworks.

Text to indices process

1.convert to numpy

2.sentences_to_indices

3. pad the sequences

Now finally, Compile and train the model:

model.fit(X_train_indices, Y_train_oh, epochs = epochs, batch_size = batch_size, shuffle=True)

Testing the model

Take the train, test data get the predictions and print the accuracy score, and cohen kappa score.

Output:

Test Cohen Kappa score: 0.993

Test Accuracy score : 0.986

Train Cohen Kappa score: 0.996

Train Accuracy score : 0.991

The training and testing is over, we will plot some wordCloud positive reviews

positive reviews wordcloud

common words: phone price, fast charging, good phone, battery life .

Note: There are some words that should not be there such as “galaxym21, samsung galaxy, redmi note”

Bigrams

Bigrams will give a very clear view which adjacent words lead to bad or good reviews.

Bigram-image

--

--