Sentiment analysis on smartphones reviews (LSTM model)
I have recently learned NLP as a part of my Deep learning course.
So, I decided to write a blog on Sentiment analysis on smartphone reviews which are scrapped from amazon.in, which would be very useful to beginners to explore NLP domain.
In this blog we will use a LSTM model to train the the text and it is a many to one model which inputs 2 or more values and outputs a value.
Contents-
- Scrapping reviews
- preprocessing and EDA
- Training the NLP model
Technologies used- Python, Tensorflow, Seaborn, BeautifulSoup
Scrapping Reviews
I have used Beautifulsoup for scraping reviews.
First we have to get ASIN numbers of all smartphones we want.
I wrote some helping functions :
getAmazonSearch: take search query and page number to return HTML page
Searchasin: take ASIN number to return Product page
SearchReviews: take All reviews link to return Reviews page of product
Function to extract ASIN numbers:
data_asin[:5]
output:['B07SDPJ4XJ', 'B089MQ622N', 'B07X4R63DF', 'B07WPVLKPW', 'B086KCCMCP']
Then, By passing the data-asin numbers, we will go to the product page and get “see all reviews” link.
Using these “see all reviews” link and setting page number will we scrap all the reviews (mobile name, review title, body, stars ) and save them to a CSV file.
Now, we completed Scrapping reviews. Next to preprocess and visualize the information.
https://github.com/msiddhu/sentiment-analysis_on_phone-reviews/blob/main/reviews-scraping.ipynb
Preprocessing and EDA
Now, we have to do data cleaning.
The data contains noise like emojis, numbers, frequently used words(is,the, for), blank spaces. We have to clean them and convert all sentences to lower case characters for training to be done easily.
Example:
RAW DATA:Nice phone from Samsung in this price. Display is good . Camera is not awesome but average. Battery will last 1 day with normal usage. N it has all necessary features. I got this for 8999 . So good phone under 10 k.As samsung so last for years.
FILTERED: nice phone samsung price display good camera awesome average battery last 1 day normal usage necessary features got good phone 10 kas samsung last years
Using this filter_text() function we can clean all the data
Lets, visualize which words in the text are most commonly used:
By this we could see that most commonly used words like battery life,value money ,camera quality, don’t buy are some crucial bigrams for deciding the rating of smartphone.
Plot a graph for distribution of ratings:
Rating 1 or 2 — Negative
Rating 3 — Neutral
Rating 4 or 5 — Positive
https://github.com/msiddhu/sentiment-analysis_on_phone-reviews/blob/main/preprocess-and-eda.ipynb
Training the LSTM model for Sentiment analysis
And then download GloVe vectors which are pre trained on large text corpus and provide word-word co-occurrence. Simply, the words with same meaning or words provide similar conclusion have similar GloVe vectors.
We are using GloVe vec of 50-dimensions which are downloaded from kaggle.
Steps for preparing,
- Read csv file
- Split the data into train data and test data.
- Find the avg. length of sentences.
- Define HyperParameters
Now, we have to create an embedding layer to convert the sentences into number vectors
The functions pretrained_embedding_layer, sentences_to_indices, read_glove_vecs are taken from Sequence Models course of Coursera.
Output:
the index of good in the vocabulary is 164328 the 50030
the word in the vocabulary is al-gama
And the sentences_to_indices function,
Input:
“the phone is good”,
“very bad”
“no star rating”
Output:
[357266, 283483, 192973,164328.],
[377946, 65963]
[262350, 341678, 301038]
Build the 2-Layer LSTM model using keras which is very easy comparatively than other frameworks.
Text to indices process
1.convert to numpy
2.sentences_to_indices
3. pad the sequences
Now finally, Compile and train the model:
model.fit(X_train_indices, Y_train_oh, epochs = epochs, batch_size = batch_size, shuffle=True)
Testing the model
Take the train, test data get the predictions and print the accuracy score, and cohen kappa score.
Output:
Test Cohen Kappa score: 0.993
Test Accuracy score : 0.986
Train Cohen Kappa score: 0.996
Train Accuracy score : 0.991
The training and testing is over, we will plot some wordCloud positive reviews
common words: phone price, fast charging, good phone, battery life .
Note: There are some words that should not be there such as “galaxym21, samsung galaxy, redmi note”
Bigrams
Bigrams will give a very clear view which adjacent words lead to bad or good reviews.
Project Link:
So, I think you have understood my explanation.
Feel free to contact me if you have doubts regarding this article.
Thank you