Scrapy Tutorial — Part 3

Jebaseelan Ravi
3 min readApr 16, 2022

--

How to crawl multiple pages using scrapy?

Photo by S O C I A L . C U T on Unsplash

PART 1, PART 2, PART 3, PART 4, PART 5

In the previous blog, we have created a scrapy spider to crawl quotes and author name from our favorite website https://quotes.toscrape.com/

We were just scraping the stuff from the first page alone from https://quotes.toscrape.com, what if you want quotes from all the pages in the website? such as

https://quotes.toscrape.com/page/2/
https://quotes.toscrape.com/page/3/
https://quotes.toscrape.com/page/4/

Now that you know how to extract data from one page from our previous blog, let’s see how to follow links to next page and crawl data recursively.

First thing is to extract the link to the next page we want to follow which is next Page link . Examining our page, we can see there is a link to the next page(right bottom) with the following markup:

Right click on Next in webpage and click inspect

Inspect the next element

HTML code for next page link


<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
</li>
</ul>

We can try extracting it in the shell:

$ scrapy shell 'https://quotes.toscrape.com/'

You should see something as

Scrapy shell log
>>> response.xpath('//li[@class="next"]/a/@href').get()
'/page/2/'

Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:

Spider code to crawl next pages

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.

In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one — handy for crawling blogs, forums and other sites with pagination.

Run the spider

scrapy crawl quotes -o quotes.json

If everything is running fine, it should give you following results

Scrapy Crawl logs

check whether the results are stored in quotes.json

Data stored in quotes.json

You can see that we have recursively crawled 100 quotes from 10 pages( 10 quotes/page).

So far we have been storing the crawled data into the JSON, in the next tutorial how to store the crawled data into DB.

Happy Scrapping!! 🕷

Please leave a comment if you face any issues.

PART 1, PART 2, PART 3, PART 4, PART 5

--

--