Scrapy Tutorial — Part 2

Jebaseelan Ravi
6 min readApr 16, 2022

--

step by step guide to create scrapy project and data extraction

Photo by Hitesh Choudhary on Unsplash

PART 1, PART 2, PART 3, PART 4, PART 5

This is the part 2 of the scrapy tutorial, if you have not read part-1 please visit to know how scrapy works and how to setup the environment.

In the last tutorial we learnt how to create a simple scrapy spider(a simple python module), In this tutorial we will learn about

  1. How to create a scrapy project?
  2. How to write a spider to crawl a site and extract the data from it?

Why do we need scrapy project when we can create a simple python file and extract the data as we did in part 1?. This is actually a good question. The reason is scrapy project offers lot of functionality such as a post processing of the data, deduplication etc. We will see them in detail in next tutorials.

Creating a Scrapy Project

TL:DR : The github repo for the tutorial is here

Make sure you have installed scrapy

$ scrapy version
2.6.1

if not install scrapy via

$ pip install scrapy

Now we can create a project

$ scrapy startproject quotesspider

This is will create following structure

Scrapy Project structure

You don’t have to worry about anything right now except spiders folders where you will be putting your spiders code. Let’s move on.

First Spider in our spider Project

Spiders are python classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make and how to parse the downloaded page content to extract data.

Put the following code in quotesspider/spiders/quotes_spiders.py

import scrapy


class QuotesSpider(scrapy.Spider):
name = "quotes"

start_urls =['https://quotes.toscrape.com/']

def parse(self, response):
with open('quotes.html', 'wb') as f:
f.write(response.body)
self.log(f'Saved file quotes.html')

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:

  • name — spider name (unique to each spider class)

start_urls — spider will being the crawl from this attributes. It can be list of URLs

parse — Default callback method for response . The HTML response(entire webpage) for the start_urls will be downloaded internally by scrapy and passed as an argument to this method.

How to execute the spider

In order to execute the spider you must be inside the project dir

$ cd quotesspider# scrapy crawl <spider_name>$ scrapy crawl quotes

Example output

Scrapy logs

you can see the spider is opened and started to crawl and created a new file quotes.html in current directory. Basically what we have done is downloaded the html into the file. Now let us shift gear and see how to extract data using scrapy

How to extract the data?

The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell.

The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider. It’s meant to be used for testing data extraction code, but you can actually use it for testing any kind of code as it is also a regular Python shell

Run

$ scrapy shell 'https://quotes.toscrape.com'
Scrapy shell

Using the shell you can learn and debug how to extract the data

>>> response
<200 https://quotes.toscrape.com/>

Try printing the entire content of the webpage

>>> response.text

You can try selecting any elements using Xpath

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

what is Xpath?

XPath is defined as XML path. It is a syntax or language for finding any element on the web page using the XML path expression. XPath is used to find the location of any element on a webpage using HTML DOM structure. Typicall Xpath expression to extract any element from webpage will be

//element[@attr_key="attr_value"]//element[@attr_key="attr_value"]/text() # if you want text 

for example consider the following HTML code

<html>
<span class="test">Hello</span>
</html>

if you want to extract the Hello from above HTML the `xpath` would be

# //element[@attr_key="attr_value"]/text() -> format//span[@class="test"]/text()

We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here.

Extracting the data — quotes and author name

Each quote in https://quotes.toscrape.com is represented by HTML elements that look like this:

<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

You can find this by hovering over any quote and right click -> inspect

Inspecting the element

Let’s open up scrapy shell and play a bit to find out how to extract the data we want:

$ scrapy shell 'https://quotes.toscrape.com'
Scrapy shell log

First we need to select the entire quoteelement then we can extract the quote and author name.

xpath expression to select the quote element

>>> response.xpath('//div[@class="quote"]')
List of Quotes

The result of running response.xpath(‘//div[@class=”quote”]’)is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our Xpath selectors directly on a particular quote:

>> quote = response.xpath('//div[@class="quote"]')[0]>> quote
<Selector xpath='//div[@class="quote"]' data='<div class="quote" itemscope itemtype...'>

Now, let’s extract text, author from that quote using the quote object we just created:

>> quote.xpath('span/text()').get()  # xpath to get quotes'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'>> quote.xpath('span/small/text()').get()  # xpath to get the author'Albert Einstein'

Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary

Data Extraction log

Full code for our spider

Integrating the above logic into our quotesspider/spiders/quotes_spiders.py file. Please update the file to the following code

Run the spider

$ scrapy crawl quotes -o quotes.json

If you run this spider, it will output the extracted data with the log and store the data in the quotes.json:

In this tutorial we have crawled data from the first page alone, In the next tutorial we will see how to crawl multiple pages using scrapy

Happy Scrapping!! 🕷

Please leave a comment if you face any issues.

PART 1, PART 2, PART 3, PART 4, PART 5

--

--