Scrapy Tutorial — Part 1
Beginner tutorial on how to scrape any website using scrapy
what is scrapy?
PART 1, PART 2, PART 3, PART 4, PART 5
Scrapy is an python framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival
I have divided the tutorial into 5 different parts to give you a deep understanding of what is scrapy framework and how we can crawl any website using the scraper framework
Part 1: Environmental setup and writing your first spider
Part 2: how to create a scrapy project and extract the data from web?
Part 3: how to crawl multiple pages using scrapy framework
Part 4: how to store the crawled data into a database?
Part 5: how to deploy a scrapy spider into a production?
For this tutorial series we are going to crawl
- Quotes
- Author Name
from the famous website https://quotes.toscrape.com/
Installing scrapy
If you are using Anaconda
conda install -c conda-forge scrapy
If you are using Python package manager pypi.
pip install scrapy
Now we have installed all the dependencies, let’s write some code. Here’s the code for a spider that scrapes famous quotes from website https://quotes.toscrape.com
Put this in a text file, name it to something like first_spider.py
and run the spider using the runspider
command
scrapy runspider first_spider.py -o quotes.json
If successful you should see something like
When this finishes you will have in the quotes.json
file a list of the quotes in JSON Lines format, containing text and author, looking like this:
{"quote": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein"},{"quote": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling"},
That’s it , you have built your first spider and crawled data from the web.
But what actually happened when you run the command runspider
. Let me explain it.
- When you run the command
runspider
scrapy looked for the scrapystart_urls
in the spider class - It assumes that it need to start the crawl from
start_urls
and send a web request to urls mentioned in thestart_urls
list and get the entire HTMLresponse
of the url in our case it is `https://quotes.toscrape.com/` response
is nothing but entire HTML content of the webpage. You don’t have to worry about hitting the website to get the response of the websites. Scrapy does that for us in more optimised way. All you have to do is specify URL you wanted to crawl instart_urls
. How cool is that?- This
response
is passed as a argument to default callback method( method to extract your data from HTML content). The default callback method isdef parse()
- In the
parse
callback method, we loop through the quote elements using a Xpath Selector, yield a Python dict with the extracted quote text and author
Don’t worry about Xpath or anything if it does not makes sense, I did not understand it either in my very first time.We will get better.
In the next post, I will explain how to create a scrapy project and extract a data from webpage.
Happy scrapping!!! 🕷
Please leave a comment if you face any issues