Web Scrapping and Natural Language Processing (Sentiment Analysis of Product Reviews) With Python

Beatrice Kemboi
CodeX
Published in
7 min readOct 29, 2022

I was recently scouring Amazon for a new pair of Apple AirPods after I lost the ones I had. As is the case with most items on Amazon, there were tons of options available to choose from. I usually don’t depend on just the star rating to order something online; instead, I prefer to read most of the customer reviews first. However, the reviews of each of the options I considered were too many, and going through all of them soon became taxing. That is when my Data Science instincts kicked in, and I ended up coding an algorithm that can fasten and automate the process of going through and analyzing all the reviews for a product. Stay tuned as I am about to share how I did it! The three main areas of Data Science that I employed are:

  1. Web Scrapping
  2. Natural Language Processing — Sentiment Analysis
Photo by Clément Hélardot on Unsplash

3. Python Data Structures and Algorithms

Part 1: Web Scrapping

There are many available resources online to learn web scrapping if you don’t know how to do it, but I am going to share how I did it for this particular project. So, first things first, I imported the required modules and libraries(BeautifulSoup, requests, and urlencode) as shown below.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode

Next, since the reviews are many and long, only ten reviews are shown on a page. To see the next reviews after the first ten, for instance, you have to click on the ‘next page’ button. For this particular product, there are a total of 9 pages of customer reviews. Every time you click on the next page button, the URL changes slightly, so to web scrape all the pages, you have to get the URL for each of the 9 pages as shown below and create a list of them.

#Number of URLs depends on how many pages of review a product has so far.
url1= "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
url2 = "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2"
url3 = "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_getr_d_paging_btm_next_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3"
url4 = "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_getr_d_paging_btm_next_4?ie=UTF8&reviewerType=all_reviews&pageNumber=4"
url5 = "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_getr_d_paging_btm_next_5?ie=UTF8&reviewerType=all_reviews&pageNumber=5"
url6 = "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_getr_d_paging_btm_next_6?ie=UTF8&reviewerType=all_reviews&pageNumber=6"
url7 = "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_getr_d_paging_btm_next_7?ie=UTF8&reviewerType=all_reviews&pageNumber=7"
url8 = "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_getr_d_paging_btm_next_8?ie=UTF8&reviewerType=all_reviews&pageNumber=8"
url9 = "https://www.amazon.com/Apple-Generation-Cancelling-Personalized-Customizable/product-reviews/B0BDHWDR12/ref=cm_cr_getr_d_paging_btm_next_9?ie=UTF8&reviewerType=all_reviews&pageNumber=9"
URLs = [url1,url2,url3,url4,url5,url6,url7,url8,url9]

I, then, iterated through the list, to web scrape each of the URLs as follows.

#Using the imported modules and libraries to carry out web scrapingfor i,url in enumerate(URLs): #for each page
print("\npage ", i+1,":") #Keeps track of page numbers
params = {'api_key': ".......................", 'url':url }
response = requests.get('http://api.scraperapi.com/',params=urlencode(params))
soup = BeautifulSoup(response.text, 'html.parser')
#Scraping the star rating part from the first page of the reviews pages.
if i == 0:

item = soup.find("span",{"data-hook":"rating-out-of-text"})
product_star_rating = item.get_text()
data_string = ""
#Scraping the titles for each page of reviews/url titles = dict()
for review_number,item in enumerate(soup.find_all("a", "review-title")): #for each review on a page
data_string = data_string + item.get_text()
if (review_number+1) not in titles:
titles[review_number+1] = data_string.strip()
data_string = ""
data_string = ""
print("\nTITLES: ",titles)

#Scraping review content contained in each page corresponding to the titles above
reviews = []
for item in soup.find_all("span", {"data-hook": "review-body"}):
data_string = data_string + item.get_text()
reviews.append(data_string)
data_string= ""

Notice that I used ‘if i == 0’ to get the star rating part since I only need the first page for that. Also, when web scraping, to know exactly what to use, e.g. data-hook, span, etc. as highlighted above, right-click on what you are trying to scrape on the web page and click on inspect. Notice, also, that I am using an ‘api’ key. You will need an API key to web scrape websites such as an Amazon website. I used the services of ScrapeAPI for the API key.

Part 2: Natural Language Processing(NLP)

For NLP, the most widely used module is NLTK(Natural Langauge Tool Kit), which has a variety of linguistic tools for preprocessing texts for analysis. To use NLTK, you have to first install it through pip install nltk on your computer terminal and then import and download it in your python IDE as illustrated below.

#Importing and downloading Natural Langauge Processing tool kits 
import nltk
nltk.download("all") #selects the entire set of book resources
from nltk import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer

You don’t have to download everything as I did. You can do nltk.download(“word_tokenizer”) if all you want to do is to tokenize your texts.

In most Data Science projects, the first and most important step is to preprocess/clean data. In my case, I preprocessed the reviews in three steps shown in the code below.

#Removing unnecessary punctuation punctuationfree_reviews =[review.translate(str.maketrans('','',string.punctuation)) for review in reviews]

#Creating a list of tokens for each of the reviews
tokenized_reviews =[word_tokenize(review) for review in punctuationfree_reviews]


#Removing special characters and common stop words from the tokens list.
stopWords = set(nltk.corpus.stopwords.words('english'))
alphalowercase_reviews =[]
for review_tokens in tokenized_reviews:
alreview = []
for i in range(len(review_tokens)):
if review_tokens[i].isalpha() and review_tokens[i] not in stopWords:
r = review_tokens[i].lower()
alreview.append(r)
alphalowercase_reviews.append(alreview)

There are a whole lot of other things that you could do to preprocess your texts depending on what is relevant to what you are trying to achieve. They include stemming, lemmatization, and parts of speech tagging, among others.

Also, you can download vocabularies that represent different parts of the speech from WordNet and use them to build a collection of vocabularies, which you can then use to validate the words in your texts. For mine, after reading many product reviews online, I realized that nouns and verbs don’t contribute much toward the polarity or sentiment score of a review. Adverbs contribute a little but are very contextual. Adjectives, on the other hand, play a significant role in determining whether a review is considered negative, neutral, or positive. Adjectives are words such as bad, good, fantastic, etc. Therefore, for my project, I built only a vocabulary of adjectives to validate my reviews against as demonstrated below.

##building a vocabulary of acceptable ADJECTIVES found in WordNet
vocab = []
with open("wordnetAdj.txt") as WordNetinputfile:
for line in WordNetinputfile:
newTerm = line.split()
vocab.append(newTerm[0])
WordNetinputfile.close()

I, then, validated it against my reviews as follows.

#Limiting our Reviews to just valid words (as per WordNet above)
validated_reviews = []
for review in alphalowercase_reviews:
valid_review = []

for token in review:
if token in vocab:

valid_review.append(token)
validated_reviews.append(valid_review)

Part 3: Sentiment Analysis

To assign every validated word of a review for all the reviews a polarity score, I used the sia polarity score from NLTK’s SentimentIntensityAnalyzer.

To see this part of the code, check out my code for this project on my GitHub account linked below.

Output

Below is how the last part of my output looks like. You can see the entire output in my GitHub account above.

page 9 :

TITLES: {1: ‘A much improved version..but’, 2: ‘Huge improvement’, 3: ‘Noise Cancellation is extraordinary!’, 4: ‘😘🥰😍 I’m in love.’, 5: ‘Better than the 1st Gen in every single way’, 6: ‘Ok upgrade from gen 1’, 7: ‘Fantastic’, 8: ‘Worth the upgrade’, 9: ‘Best earbuds in the market’, 10: ‘Just Amazing 🤩’}
Sentiment score = 0.0929 , Review number 1 is positive
Sentiment score = 0.0862 , Review number 2 is positive
Sentiment score = -0.0068 , Review number 3 is negative
Sentiment score = 0.1101 , Review number 4 is positive
Sentiment score = 0.126 , Review number 5 is positive
Sentiment score = 0.1697 , Review number 6 is positive
Sentiment score = 0.0 , Review number 7 is neutral
Sentiment score = 0.1202 , Review number 8 is positive
Sentiment score = 0.0232 , Review number 9 is positive
Sentiment score = 0.1717 , Review number 10 is positive

Total Reviews = 90
Total Very Positive Reviews = 0 Total Positive Reviews = 84
Total Very Negative Reviews = 0 Total Negative Reviews = 5
Total Neutral Reviews = 1

Overall Positive Reviews: 93.3333%
Overall Negative Reviews: 5.5556%
Overall Neutral Reviews: 1.1111%

Compare to the product’s star rating of 4.7 out of 5

Conclusion

Comparing the percentage of the overall positive reviews arrived at by my algorithm and the product’s star rating from the Amazon website, as seen in the output above, makes me believe that my algorithm performed quite well. Of course, there are several ways in which its accuracy can be further improved. I could, for instance, include adverbs in the vocabulary used to validate the reviews and then use NLTK’s Bigram and Trigram CollocationFinder to account for the sentiment score contributed by contextual sentiments created by adverbs. In some cases, even though the title is positive, like in review number 3 above, the algorithm rated its contents as negative. This is probably because the reviewer used negative adjectives in the contents of their review to discuss the previous version of the product in comparison with the one they are reviewing. I aim to incorporate more steps to improve the accuracy of this algorithm in the future.

I hope you learned something. Thanks for reading and stay tuned for more insightful Data Science and Software Engineering stuff!

--

--