Scrapy Tutorial — Part 2
step by step guide to create scrapy project and data extraction
PART 1, PART 2, PART 3, PART 4, PART 5
This is the part 2 of the scrapy tutorial, if you have not read part-1 please visit to know how scrapy works and how to setup the environment.
In the last tutorial we learnt how to create a simple scrapy spider(a simple python module), In this tutorial we will learn about
- How to create a scrapy project?
- How to write a spider to crawl a site and extract the data from it?
Why do we need scrapy project when we can create a simple python file and extract the data as we did in part 1?. This is actually a good question. The reason is scrapy project offers lot of functionality such as a post processing of the data, deduplication etc. We will see them in detail in next tutorials.
Creating a Scrapy Project
TL:DR : The github repo for the tutorial is here
Make sure you have installed scrapy
$ scrapy version
2.6.1
if not install scrapy via
$ pip install scrapy
Now we can create a project
$ scrapy startproject quotesspider
This is will create following structure
You don’t have to worry about anything right now except spiders
folders where you will be putting your spiders code. Let’s move on.
First Spider in our spider Project
Spiders are python classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider
and define the initial requests to make and how to parse the downloaded page content to extract data.
Put the following code in quotesspider/spiders/quotes_spiders.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls =['https://quotes.toscrape.com/']
def parse(self, response):
with open('quotes.html', 'wb') as f:
f.write(response.body)
self.log(f'Saved file quotes.html')
As you can see, our Spider subclasses scrapy.Spider
and defines some attributes and methods:
name
— spider name (unique to each spider class)
start_urls
— spider will being the crawl from this attributes. It can be list of URLs
parse
— Default callback method for response
. The HTML response(entire webpage) for the start_urls
will be downloaded internally by scrapy and passed as an argument to this method.
How to execute the spider
In order to execute the spider you must be inside the project dir
$ cd quotesspider# scrapy crawl <spider_name>$ scrapy crawl quotes
Example output
you can see the spider is opened and started to crawl and created a new file quotes.html
in current directory. Basically what we have done is downloaded the html into the file. Now let us shift gear and see how to extract data using scrapy
How to extract the data?
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell.
The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider. It’s meant to be used for testing data extraction code, but you can actually use it for testing any kind of code as it is also a regular Python shell
Run
$ scrapy shell 'https://quotes.toscrape.com'
Using the shell you can learn and debug how to extract the data
>>> response
<200 https://quotes.toscrape.com/>
Try printing the entire content of the webpage
>>> response.text
You can try selecting any elements using Xpath
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'
what is Xpath?
XPath is defined as XML path. It is a syntax or language for finding any element on the web page using the XML path expression. XPath is used to find the location of any element on a webpage using HTML DOM structure. Typicall Xpath
expression to extract any element from webpage will be
//element[@attr_key="attr_value"]//element[@attr_key="attr_value"]/text() # if you want text
for example consider the following HTML code
<html>
<span class="test">Hello</span>
</html>
if you want to extract the Hello
from above HTML the `xpath` would be
# //element[@attr_key="attr_value"]/text() -> format//span[@class="test"]/text()
We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here.
Extracting the data — quotes and author name
Each quote in https://quotes.toscrape.com is represented by HTML elements that look like this:
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
You can find this by hovering over any quote and right click -> inspect
Let’s open up scrapy shell and play a bit to find out how to extract the data we want:
$ scrapy shell 'https://quotes.toscrape.com'
First we need to select the entire quote
element then we can extract the quote and author name.
xpath expression to select the quote
element
>>> response.xpath('//div[@class="quote"]')
The result of running response.xpath(‘//div[@class=”quote”]’)
is a list-like object called SelectorList
, which represents a list of Selector
objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.
Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our Xpath selectors directly on a particular quote:
>> quote = response.xpath('//div[@class="quote"]')[0]>> quote
<Selector xpath='//div[@class="quote"]' data='<div class="quote" itemscope itemtype...'>
Now, let’s extract text
, author
from that quote using the quote
object we just created:
>> quote.xpath('span/text()').get() # xpath to get quotes'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'>> quote.xpath('span/small/text()').get() # xpath to get the author'Albert Einstein'
Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary
Full code for our spider
Integrating the above logic into our quotesspider/spiders/quotes_spiders.py
file. Please update the file to the following code
Run the spider
$ scrapy crawl quotes -o quotes.json
If you run this spider, it will output the extracted data with the log and store the data in the quotes.json
:
In this tutorial we have crawled data from the first page alone, In the next tutorial we will see how to crawl multiple pages using scrapy
Happy Scrapping!! 🕷
Please leave a comment if you face any issues.