Scrapy Tutorial — Part 3

3 min readApr 16, 2022

How to crawl multiple pages using scrapy?

Photo by S O C I A L . C U T on Unsplash

In the previous blog, we have created a scrapy spider to crawl quotes and author name from our favorite website https://quotes.toscrape.com/

We were just scraping the stuff from the first page alone from https://quotes.toscrape.com, what if you want quotes from all the pages in the website? such as

https://quotes.toscrape.com/page/2/
https://quotes.toscrape.com/page/3/
https://quotes.toscrape.com/page/4/

Now that you know how to extract data from one page from our previous blog, let’s see how to follow links to next page and crawl data recursively.

First thing is to extract the link to the next page we want to follow which is next Page link . Examining our page, we can see there is a link to the next page(right bottom) with the following markup:

Right click on Next in webpage and click inspect

HTML code for next page link


<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

We can try extracting it in the shell:

$ scrapy shell 'https://quotes.toscrape.com/'

You should see something as

>>> response.xpath('//li[@class="next"]/a/@href').get()
'/page/2/'

Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:

Spider code to crawl next pages

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.

In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one — handy for crawling blogs, forums and other sites with pagination.