Scrapy Tutorial — Part 4
How to store data into DB in scrapy?
PART 1, PART 2, PART 3, PART 4, PART 5
In the previous tutorial we have created a scrapy spider to crawl https://quotes.toscrape.com/ and stored the crawled data into JSON file but in the real world we typically store the data in the Databases such as MYSQL, SQLite, Elastic search, Postgres etc
In this tutorial we will learn how to store the crawled data into SQLite database.
What is SQLite?
SqLite is a DB is file format DB which means you don’t need to install any packages or installer. I have chosen this because it is easy to get started since it does not require any installation but you can this tutorial to any DB of your choice.
SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.
How to store crawled data into DB using Scrapy?
After an item/data has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially. You can think of item pipeline as a component used for post processing of the data.
Item pipeline is invoked when you call yield
from your quotesspider/spiders/quotes_spiders.py
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
Typical uses of item pipelines are:
- cleansing HTML data
- validating scraped data (checking that the items contain certain fields)
- checking for duplicates (and dropping them)
- storing the scraped item in a database
We will be using the use case #4 in this tutorial.
Item pipeline example
Writing items to SQLite DB
update your quotesspider/pipelines.py
to the following code
Let ‘s us understand what this pipeline does
open_spider
— This method is called when the spider is opened.
close_spider
— This method is called when the spider is closed.
process_item
— This method is called whenever a item is crawled by our spider.
Now that we created our pipeline but we need to tell the spider to use this pipeline class
Activating an Item Pipeline component
To activate an Item Pipeline component you must add its class to the ITEM_PIPELINES
setting, like in the following example:
quotesspider/settings.py
ITEM_PIPELINES = {
'quotesspider.pipelines.QuotesspiderPipeline': 300,
}
That’s it , Now run your spider using ( you need to be inside quotesspider
folder
scrapy crawl quotes
How to Verify whether the data stored in DB?
Open up https://sqliteonline.com/ on your browser which provides UI for your SQLite DB
click File->Open DB-> select your sqlite file quotes.db
Once the window is opened you can run select * from quotes
to see the results.
You should see something like this.
There is not auto refresh on this online tool .If you run your spider again you need to reload your file(File->Open DB-> select your file quotes.db)
to see the changes.
Happy Scrapping!! 🕷
Please leave a comment if you face any issues.