How to Scrape Amazon reviews with python [Code]

Jesal Maddy
4 min readNov 3, 2019

--

This is a post about how to use python to scrape amazon reviews for Free!

‘pip install requests lxml dateutil ipython pandas’

If that doesn’t work, try entering each package in manually with pip install, I. E’. Pip install requests’ enter, then next one. If that doesn’t work, do the same thing, but instead, replace pip with ‘python -m pip’. For example :

‘python -m pip install requests’

If nothing on the command prompt confirms that the package you entered was installed, there’s something wrong with your python installation. I’d uninstall python, restart the computer, and then reinstall it following the instructions above.

Scripting a solution to scraping amazon reviews is one method that yields a reliable success rate and a limited margin for error since it will always do what it is supposed to do, untethered by other factors. However, certain proxy providers such as Octoparse have built-in applications for this task in particular.

The following script you may type line by line into ipython. Do this by first opening your command prompt/terminal and navigating to a directory where you may wish to have your scrapes downloaded. Do so by typing into the prompt ‘cd [PATH]’ with the path being directly(for example, ‘C:/Users/me/Documents/amazon’. Then, type into the command prompt ‘ipython’ and it should open, like so:

Then, you can try copying and pasting this script, found here, into iPython.

The Code

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Written as part of https://www.scrapehero.com/how-to-scrape-amazon-product-reviews-using-python/
from lxml import html
from json import dump,loads
from requests import get
import json
from re import sub
from dateutil import parser as dateparser
from time import sleep

def ParseReviews(asin):
# This script has only been tested with Amazon.com
amazon_url = ‘http://www.amazon.com/dp/'+asin
# Add some recent user agent to prevent amazon from blocking the request
# Find some chrome user agent strings here https://udger.com/resources/ua-list/browser-detail?browser=Chrome
headers = {‘User-Agent’: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36’}
for i in range(5):
response = get(amazon_url, headers = headers, verify=False, timeout=30)
if response.status_code == 404:
return {“url”: amazon_url, “error”: “page not found”}
if response.status_code != 200:
continue

# Removing the null bytes from the response.
cleaned_response = response.text.replace(‘\x00’, ‘’)

parser = html.fromstring(cleaned_response)
XPATH_AGGREGATE = ‘//span[@id=”acrCustomerReviewText”]’
XPATH_REVIEW_SECTION_1 = ‘//div[contains(@id,”reviews-summary”)]’
XPATH_REVIEW_SECTION_2 = ‘//div[@data-hook=”review”]’
XPATH_AGGREGATE_RATING = ‘//table[@id=”histogramTable”]//tr’
XPATH_PRODUCT_NAME = ‘//h1//span[@id=”productTitle”]//text()’
XPATH_PRODUCT_PRICE = ‘//span[@id=”priceblock_ourprice”]/text()’

raw_product_price = parser.xpath(XPATH_PRODUCT_PRICE)
raw_product_name = parser.xpath(XPATH_PRODUCT_NAME)
total_ratings = parser.xpath(XPATH_AGGREGATE_RATING)
reviews = parser.xpath(XPATH_REVIEW_SECTION_1)

product_price = ‘’.join(raw_product_price).replace(‘,’, ‘’)
product_name = ‘’.join(raw_product_name).strip()

if not reviews:
reviews = parser.xpath(XPATH_REVIEW_SECTION_2)
ratings_dict = {}
reviews_list = []

# Grabing the rating section in product page
for ratings in total_ratings:
extracted_rating = ratings.xpath(‘./td//a//text()’)
if extracted_rating:
rating_key = extracted_rating[0]
raw_raing_value = extracted_rating[1]
rating_value = raw_raing_value
if rating_key:
ratings_dict.update({rating_key: rating_value})

# Parsing individual reviews
for review in reviews:
XPATH_RATING = ‘.//i[@data-hook=”review-star-rating”]//text()’
XPATH_REVIEW_HEADER = ‘.//a[@data-hook=”review-title”]//text()’
XPATH_REVIEW_POSTED_DATE = ‘.//span[@data-hook=”review-date”]//text()’
XPATH_REVIEW_TEXT_1 = ‘.//div[@data-hook=”review-collapsed”]//text()’
XPATH_REVIEW_TEXT_2 = ‘.//div//span[@data-action=”columnbalancing-showfullreview”]/@data-columnbalancing-showfullreview’
XPATH_REVIEW_COMMENTS = ‘.//span[@data-hook=”review-comment”]//text()’
XPATH_AUTHOR = ‘.//span[contains(@class,”profile-name”)]//text()’
XPATH_REVIEW_TEXT_3 = ‘.//div[contains(@id,”dpReviews”)]/div/text()’

raw_review_author = review.xpath(XPATH_AUTHOR)
raw_review_rating = review.xpath(XPATH_RATING)
raw_review_header = review.xpath(XPATH_REVIEW_HEADER)
raw_review_posted_date = review.xpath(XPATH_REVIEW_POSTED_DATE)
raw_review_text1 = review.xpath(XPATH_REVIEW_TEXT_1)
raw_review_text2 = review.xpath(XPATH_REVIEW_TEXT_2)
raw_review_text3 = review.xpath(XPATH_REVIEW_TEXT_3)

# Cleaning data
author = ‘ ‘.join(‘ ‘.join(raw_review_author).split())
review_rating = ‘’.join(raw_review_rating).replace(‘out of 5 stars’, ‘’)
review_header = ‘ ‘.join(‘ ‘.join(raw_review_header).split())

try:
review_posted_date = dateparser.parse(‘’.join(raw_review_posted_date)).strftime(‘%d %b %Y’)
except:
review_posted_date = None
review_text = ‘ ‘.join(‘ ‘.join(raw_review_text1).split())

# Grabbing hidden comments if present
if raw_review_text2:
json_loaded_review_data = loads(raw_review_text2[0])
json_loaded_review_data_text = json_loaded_review_data[‘rest’]
cleaned_json_loaded_review_data_text = re.sub(‘<.*?>’, ‘’, json_loaded_review_data_text)
full_review_text = review_text+cleaned_json_loaded_review_data_text
else:
full_review_text = review_text
if not raw_review_text1:
full_review_text = ‘ ‘.join(‘ ‘.join(raw_review_text3).split())

raw_review_comments = review.xpath(XPATH_REVIEW_COMMENTS)
review_comments = ‘’.join(raw_review_comments)
review_comments = sub(‘[A-Za-z]’, ‘’, review_comments).strip()
review_dict = {
‘review_comment_count’: review_comments,
‘review_text’: full_review_text,
‘review_posted_date’: review_posted_date,
‘review_header’: review_header,
‘review_rating’: review_rating,
‘review_author’: author

}
reviews_list.append(review_dict)

data = {
‘ratings’: ratings_dict,
‘reviews’: reviews_list,
‘url’: amazon_url,
‘name’: product_name,
‘price’: product_price

}
return data

return {“error”: “failed to process the page”, “url”: amazon_url}

def ReadAsin():
# Add your own ASINs here
AsinList = [‘B01ETPUQ6E’, ‘B017HW9DEW’, ‘B00U8KSIOM’]
extracted_data = []

for asin in AsinList:
print(“Downloading and processing page http://www.amazon.com/dp/" + asin)
extracted_data.append(ParseReviews(asin))
sleep(5)
f = open(‘data.json’, ‘w’)
dump(extracted_data, f, indent=4)
f.close()

if __name__ == ‘__main__’:
ReadAsin()

It appears to be plug and play, except for where the user must enter the specifics of which products they want to scrape reviews from. Be sure to read all lines that begin with #, because those are comments that will instruct you on what to do. For example, when it says,

‘# Find some chrome user agent strings here https://udger.com/resources/ua-list/browser-detail?browser=Chrome, ‘

it’s advised to follow those instructions in order to get the script to work. Also, notice at the bottom where it has an Asin list and tells you to create your own. In this instance, get an Amazon developer API, and find your ASINS. each of the products you instead to crawl, and paste each of them into this list, following the same formatting. Following this, and everything else, it should work as explained.

Source: privateproxyreviews.com

--

--

Jesal Maddy

Help Guide to find best proxy site for sufing the web anonymously.