Text Mining: How to extract Amazon Reviews using Scrapy

Published in

CodeX

6 min readAug 21, 2021

Ever wondered? Life would be easier if there could be ways to know how well your product performs and what do people feel about your product? The Solution -Text Mining Techniques.

Let’s understand what’s NLP or Text Mining?

NLP stands for Natural Language Processing of text and speech. The linguistic analysis involving text is called Text Mining that uses computational methods and techniques to extract high quality structured information from unstructured text data available from books, financial reports, news articles, social media messages, Wikipedia etc.

In today’s era, the internet is flooded with voluminous data, of which 80% of data is accounted to be unstructured. In general, interpretation or comprehension of unstructured content is often easy for people, but very complex for machine or computer program. The reason behind; it is full ambiguous, fuzzy, influential & probabilistic terms and phrases, it often strongly relies on commonsense, knowledge and reasoning, and it sometimes intends sarcasm. Thus, the data needs to be converted into structured one to perform analysis and generate meaningful insights.

There are numerous techniques available to perform Text Mining, the purpose of this article is to learn using Scrapy.

Now, let’s learn how to Extract Product Reviews from Amazon using SCRAPY.

To extract Amazon Reviews about a product being sold on Amazon, you need to follow below steps:

Step 1:

If you are using conda, then you can install scrapy from the conda-forge using the following command.

conda install -c conda-forge scrapy

In case you are not using conda, you can use pip and directly install it in your system using the below command.

!pip install scrapy

Step 2:

After scrapy installation, open cmd prompt from conda.

To create a scrapy project use following command in cmd prompt.

scrapy startproject Scrape_AmazonReviews

Once you have created the project, you will find “Scrape_AmazonReviews” file in root directory of your system. In which, one is a folder which contains your scrapy code, and other is your scrapy configuration file. Scrapy configuration helps in running and deploying the Scrapy project on a server.

Step 3:

Once we have the project in place, we need to create a spider. A spider is a chunk of python code that determines, how a web page will be scrapped. It is the main component that crawls different web pages and extracts content out of it. In our case, this will be the code chunk that will perform the task of visiting Amazon and scraping Amazon reviews. To create a spider, you can type in following command in same cmd prompt.

scrapy genspider amazon_review https://www.amazon.in/

This will create python file named “amazon_review.py” in your root directory, which you need to place inside the folder named “Scrape_AmazonReviews\Scrape_AmazonReviews\spiders”.

The “amazon_review.py” file contains below scrapy parser code:

Step 4:

Spider gets created within a “spiders” folder inside the project directory. Once you go into the “Scrape_AmazonReviews” folder/project, you will see a directory structure like the one below.

Scrapy files description:

Let us understand the “Scrape_AmazonReviews” Scrapy project structure and supporting files inside in a bit more detail. Main files inside Scrapy project directory includes,

items.py

Items are containers that will be loaded with the scraped data.

middleware.py

The spider middleware is a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to handle the requests and items that are generated from spiders.

pipelines.py

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially. Each item pipeline component is a Python class.

settings.py

It allows one to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.

spiders folder

The Spiders is a directory which contains all spiders/crawlers as Python classes. Whenever one runs/crawls any spider, then scrapy looks into this directory and tries to find the spider with its name provided by the user. Spiders define how a certain site or a group of sites will be scraped, including how to perform the crawl and how to extract data from their pages.

Step 5:

Analyzing HTML structure of the web page:

For your understanding, let’s take an example of below product available on Amazon: https://www.amazon.in/product-reviews/9387779262/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&pageNumber= (Extraction of product reviews from this web-page of Amazon)

Now, before we actually start writing spider implementation in python for scraping Amazon reviews, we need to identify patterns in the target web page.

Below is the page we are trying to scrape which contains different reviews about the product ‘My First Library: Boxset of 10 Board Books for Kids’ on Amazon.

We start by opening the web page using the inspect-element feature in the browser. There you can see the HTML code of the web page. After a little bit of exploration, I found the following HTML structure which renders the reviews on the web page.

On the reviews page, there is a division with id “cm_cr-review_list”. This division has multiple sub-division within which the review content resides. We are planning to extract both star rating and review text from the web page. Upon further inspection, we can see that every review subdivision is further divided into multiple blocks.

One of these blocks contains required star ratings, and others includes the text of review needed. By looking more closely, we can easily see that star rating division is represented by the class attribute “review-rating” and review texts are represented by the class “review-text”. All we need to do now is just to pick these patterns up using our Scrapy parser.

Step 6:

Then we need to define a parse function that gets fired up whenever our spider visits a new page. In the parse function, we need to identify patterns in the targeted page structure. Spider then looks for these patterns and extracts them out from the web page.

Below is a code sample of Scrapy parser for scraping Amazon reviews. let’s name the file as “extract_reiews.py” and save it in “Scrape_AmazonReviews\Scrape_AmazonReviews\spiders” folder.

Step 7:

Finally, we have successfully built our spider. The only task now left is to run this spider. We can run this spider by using the runspider command. It takes to input the spider file to run and the output file to store the collected results. In the case below, spider file is amazon_reviews.py and the output file is amazon_reviews.csv

To run this, open cmd prompt and type below command:

scrapy runspider Scrape_AmazonReviews\Scrape_AmazonReviews\spiders\extract_reviews.py -o extract_reviews.csv

The extracted “extract_reviews.csv” will get saved to default directory.

Step 8:

The extracted reviews file is ready to use and can be open using python as below:

The output looks like:

Step 9:

Testing our code for different product reviews. Let’s check whether our code works on different products, for example say Bosch washing machine front load with web link as follows: https://www.amazon.in/Bosch-Inverter-Control-Automatic-Loading/product-reviews/B08SR372S7/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=

Replace web link in previous “extract_reiews.py” with this one and run cmd as

scrapy runspider Scrape_AmazonReviews\Scrape_AmazonReviews\spiders\extract_reviews_test2.py -o extract_reviews_test2.csv

And reading it using pandas python.

Yeay! It works!

Now your structured data file is ready to perform NLP - Linguistic Analysis using Machine Learning or Artificial Intelligence algorithms.

P.S. I shall write more about how to perform NLP Text Mining including Text Preprocessing, Features Extraction, Named Entity Recognition and Emotion Mining or Sentiment Analysis of Product Reviews on Amazon in my coming article. So do follow my posts on Medium 😃 Happy Learning!

Please follow me on GitHub having 170+ and more such repositories.

Also, do let me know, what do you think about this article.