Scrapy Tutorial — Part 1

Jebaseelan Ravi
3 min readApr 16, 2022

--

Beginner tutorial on how to scrape any website using scrapy

what is scrapy?

Photo by Hitesh Choudhary on Unsplash

PART 1, PART 2, PART 3, PART 4, PART 5

Scrapy is an python framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival

I have divided the tutorial into 5 different parts to give you a deep understanding of what is scrapy framework and how we can crawl any website using the scraper framework

Part 1: Environmental setup and writing your first spider

Part 2: how to create a scrapy project and extract the data from web?

Part 3: how to crawl multiple pages using scrapy framework

Part 4: how to store the crawled data into a database?

Part 5: how to deploy a scrapy spider into a production?

For this tutorial series we are going to crawl

  1. Quotes
  2. Author Name

from the famous website https://quotes.toscrape.com/

Installing scrapy

If you are using Anaconda

conda install -c conda-forge scrapy

If you are using Python package manager pypi.

pip install scrapy

Now we have installed all the dependencies, let’s write some code. Here’s the code for a spider that scrapes famous quotes from website https://quotes.toscrape.com

Simple spider example

Put this in a text file, name it to something like first_spider.py and run the spider using the runspider command

scrapy runspider first_spider.py -o quotes.json

If successful you should see something like

Scrapy Output

When this finishes you will have in the quotes.json file a list of the quotes in JSON Lines format, containing text and author, looking like this:

{"quote": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein"},{"quote": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling"},

That’s it , you have built your first spider and crawled data from the web.

But what actually happened when you run the command runspider . Let me explain it.

  1. When you run the command runspider scrapy looked for the scrapy start_urls in the spider class
  2. It assumes that it need to start the crawl from start_urls and send a web request to urls mentioned in the start_urls list and get the entire HTML responseof the url in our case it is `https://quotes.toscrape.com/`
  3. response is nothing but entire HTML content of the webpage. You don’t have to worry about hitting the website to get the response of the websites. Scrapy does that for us in more optimised way. All you have to do is specify URL you wanted to crawl in start_urls . How cool is that?
  4. This response is passed as a argument to default callback method( method to extract your data from HTML content). The default callback method is def parse()
  5. In the parse callback method, we loop through the quote elements using a Xpath Selector, yield a Python dict with the extracted quote text and author

Don’t worry about Xpath or anything if it does not makes sense, I did not understand it either in my very first time.We will get better.

In the next post, I will explain how to create a scrapy project and extract a data from webpage.

Happy scrapping!!! 🕷

Please leave a comment if you face any issues

PART 1, PART 2, PART 3, PART 4, PART 5

--

--