Scraping data from Amazon
The technologies I use:
Published in
3 min readDec 8, 2021
- Scrapy
- Flask
- Crochet
Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.
- Now first, go into your terminal, activate your virtual environment and if you haven’t installed Scrapy then install it using,
pip install scrapy
- Now from your terminal go into the directory where you want to start your project and run(Note: the tutorial is the name of the project we’re creating)
scrapy startproject tutorial
- And now create an “amazon_scraping.py” file in spiders directory
Writing Code
Open the “amazon_scraping.py” file you just created and let’s start coding,
- First, import these essential libraries,
- Create a python class defining all the variables that we want to scrape
- Create the main class on which Scrapy will come to scrape the data
- In the same class define a function that will be used to scrape the link you mentioned above to get the link of “all reviews tag” on the Amazon page.
- Now Scrapy is on the “all reviews page” of amazon, so now we will write a function that will scrape that page for all the above-mentioned items and store it in a JSON file.
Part of Scrapy that’s it!
Now, Scrapy integrate the Scrapy for scraping any website with FLASK and build up a web form such that on a click of a button the entire scrapy code will be up and running and returns us the scraped data.
Creating an HTML Form
- Now go into the app.py file and import these following Libraries
- Note: If you haven’t installed these libraries then you should first install these libraries from your terminal using pip.
- To install the required libraries go into your terminal and run
pip install crochet
pip install flask
pip install scrapy
- Now we will define the Basic Flask structure and get the link from the FORM and store it in myBaseUrl variable.
- After getting the link we will then pass it to the Scraping function mentioned in the next step and store the scraped data to the output_data list and return data send to html.
- Now we will define two functions in which the code will iterate until the entire scraping process is complete.
- First, the scrape_with_crochet function will connect to the dispatcher and that will help in maintaining that loop.
- Now the code would go to the inbuilt scrapy crawl_runner function and with each yield response in the scrapy file, the control will go to the crawl_result function and append that item to the output_data list.
Then, create a new HTML to show data.
Finally, run the code..
python main.py
Conclusion
- So with this, we have integrated our entire Scrapy code with Flask such that on the click of a button the entire product reviews data will get scraped and gets stored in a JSON file.
- Now what’s happening in this Project is that whenever you scrape a link it’s data gets stored in a list and gets displayed, but when you return the code with the same link it will again perform the entire scraping procedure which is very inefficient.