Scraping data from Amazon

The technologies I use:

Yusif
Pragmatech
3 min readDec 8, 2021

--

  1. Scrapy
  2. Flask
  3. Crochet

Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.

  • Now first, go into your terminal, activate your virtual environment and if you haven’t installed Scrapy then install it using,
pip install scrapy
  • Now from your terminal go into the directory where you want to start your project and run(Note: the tutorial is the name of the project we’re creating)
scrapy startproject tutorial
  • And now create an “amazon_scraping.py” file in spiders directory

Writing Code

Open the “amazon_scraping.py” file you just created and let’s start coding,

  • First, import these essential libraries,
  • Create a python class defining all the variables that we want to scrape
  • Create the main class on which Scrapy will come to scrape the data
  • In the same class define a function that will be used to scrape the link you mentioned above to get the link of “all reviews tag” on the Amazon page.
  • Now Scrapy is on the “all reviews page” of amazon, so now we will write a function that will scrape that page for all the above-mentioned items and store it in a JSON file.

Part of Scrapy that’s it!

Now, Scrapy integrate the Scrapy for scraping any website with FLASK and build up a web form such that on a click of a button the entire scrapy code will be up and running and returns us the scraped data.

Creating an HTML Form

  • Now go into the app.py file and import these following Libraries
  • Note: If you haven’t installed these libraries then you should first install these libraries from your terminal using pip.
  • To install the required libraries go into your terminal and run
pip install crochet
pip install flask
pip install scrapy
  • Now we will define the Basic Flask structure and get the link from the FORM and store it in myBaseUrl variable.
  • After getting the link we will then pass it to the Scraping function mentioned in the next step and store the scraped data to the output_data list and return data send to html.
  • Now we will define two functions in which the code will iterate until the entire scraping process is complete.
  • First, the scrape_with_crochet function will connect to the dispatcher and that will help in maintaining that loop.
  • Now the code would go to the inbuilt scrapy crawl_runner function and with each yield response in the scrapy file, the control will go to the crawl_result function and append that item to the output_data list.

Then, create a new HTML to show data.

Finally, run the code..

python main.py

Conclusion

  • So with this, we have integrated our entire Scrapy code with Flask such that on the click of a button the entire product reviews data will get scraped and gets stored in a JSON file.
  • Now what’s happening in this Project is that whenever you scrape a link it’s data gets stored in a list and gets displayed, but when you return the code with the same link it will again perform the entire scraping procedure which is very inefficient.

--

--