Scraping data from Amazon

The technologies I use:

Published in

Pragmatech

3 min readDec 8, 2021

Scrapy
Flask
Crochet

Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.

Now first, go into your terminal, activate your virtual environment and if you haven’t installed Scrapy then install it using,

pip install scrapy

Now from your terminal go into the directory where you want to start your project and run(Note: the tutorial is the name of the project we’re creating)

scrapy startproject tutorial

And now create an “amazon_scraping.py” file in spiders directory

Writing Code

Open the “amazon_scraping.py” file you just created and let’s start coding,

First, import these essential libraries,

Create a python class defining all the variables that we want to scrape

Create the main class on which Scrapy will come to scrape the data

In the same class define a function that will be used to scrape the link you mentioned above to get the link of “all reviews tag” on the Amazon page.

Now Scrapy is on the “all reviews page” of amazon, so now we will write a function that will scrape that page for all the above-mentioned items and store it in a JSON file.

Part of Scrapy that’s it!

Now, Scrapy integrate the Scrapy for scraping any website with FLASK and build up a web form such that on a click of a button the entire scrapy code will be up and running and returns us the scraped data.

Creating an HTML Form

Now go into the app.py file and import these following Libraries

Note: If you haven’t installed these libraries then you should first install these libraries from your terminal using pip.
To install the required libraries go into your terminal and run

pip install crochet
pip install flask
pip install scrapy

Now we will define the Basic Flask structure and get the link from the FORM and store it in myBaseUrl variable.
After getting the link we will then pass it to the Scraping function mentioned in the next step and store the scraped data to the output_data list and return data send to html.

Now we will define two functions in which the code will iterate until the entire scraping process is complete.
First, the scrape_with_crochet function will connect to the dispatcher and that will help in maintaining that loop.
Now the code would go to the inbuilt scrapy crawl_runner function and with each yield response in the scrapy file, the control will go to the crawl_result function and append that item to the output_data list.

Then, create a new HTML to show data.

Finally, run the code..

python main.py

Conclusion

So with this, we have integrated our entire Scrapy code with Flask such that on the click of a button the entire product reviews data will get scraped and gets stored in a JSON file.
Now what’s happening in this Project is that whenever you scrape a link it’s data gets stored in a list and gets displayed, but when you return the code with the same link it will again perform the entire scraping procedure which is very inefficient.

Scraping data from Amazon

The technologies I use:

Conclusion

Written by Yusif