Scrapy Tutorial — Part 4

Jebaseelan Ravi
3 min readApr 16, 2022

--

How to store data into DB in scrapy?

PART 1, PART 2, PART 3, PART 4, PART 5

In the previous tutorial we have created a scrapy spider to crawl https://quotes.toscrape.com/ and stored the crawled data into JSON file but in the real world we typically store the data in the Databases such as MYSQL, SQLite, Elastic search, Postgres etc

In this tutorial we will learn how to store the crawled data into SQLite database.

Photo by Hitesh Choudhary on Unsplash

What is SQLite?

SqLite is a DB is file format DB which means you don’t need to install any packages or installer. I have chosen this because it is easy to get started since it does not require any installation but you can this tutorial to any DB of your choice.

SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.

How to store crawled data into DB using Scrapy?

After an item/data has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially. You can think of item pipeline as a component used for post processing of the data.

Item pipeline is invoked when you call yield from your quotesspider/spiders/quotes_spiders.py

Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

Typical uses of item pipelines are:

  • cleansing HTML data
  • validating scraped data (checking that the items contain certain fields)
  • checking for duplicates (and dropping them)
  • storing the scraped item in a database

We will be using the use case #4 in this tutorial.

Item pipeline example

Writing items to SQLite DB

update your quotesspider/pipelines.py to the following code

Spider pipeline to store Data to sqlite

Let ‘s us understand what this pipeline does

open_spider — This method is called when the spider is opened.

close_spider — This method is called when the spider is closed.

process_item — This method is called whenever a item is crawled by our spider.

Now that we created our pipeline but we need to tell the spider to use this pipeline class

Activating an Item Pipeline component

To activate an Item Pipeline component you must add its class to the ITEM_PIPELINES setting, like in the following example:

quotesspider/settings.py

ITEM_PIPELINES = {
'quotesspider.pipelines.QuotesspiderPipeline': 300,
}

That’s it , Now run your spider using ( you need to be inside quotesspider folder

scrapy crawl quotes

How to Verify whether the data stored in DB?

Open up https://sqliteonline.com/ on your browser which provides UI for your SQLite DB

click File->Open DB-> select your sqlite file quotes.db

Once the window is opened you can run select * from quotes to see the results.

You should see something like this.

Crawled Data stored in Sqlite DB

There is not auto refresh on this online tool .If you run your spider again you need to reload your file(File->Open DB-> select your file quotes.db) to see the changes.

Happy Scrapping!! 🕷

Please leave a comment if you face any issues.

PART 1, PART 2, PART 3, PART 4, PART 5

--

--