Web Scraping with Python Scrapy

Umair Khalid
8 min readMay 25, 2023

--

Python Scrapy is an open-source web scraping framework that allows you to extract data from websites. It provides a set of tools and libraries for crawling websites, fetching web pages, and extracting structured data from the HTML or XML documents.

Key features of Python Scrapy include:

Spider-based crawling: Scrapy operates using spiders, which are Python classes that define how to navigate a website and extract data. Spiders can be customized to follow links, submit forms, and handle different types of web content.

Scrapy Shell: It provides an interactive shell for testing and debugging your scraping code. You can inspect the website’s response, experiment with XPath or CSS selectors to extract data, and test your spider logic.

Item pipelines: Scrapy pipelines are used for processing and storing the scraped data. You can define pipelines to clean, validate, or transform the extracted data before saving it to a file or database.

Request scheduling and throttling: Scrapy allows you to control the rate at which requests are sent to a website, helping you avoid overloading the server or getting blocked. You can set download delays, concurrent request limits, and prioritize certain requests.

Extensibility: Scrapy is highly extensible and provides a wide range of extensions and middle-wares to customize its behavior. You can add custom middle-ware for handling cookies, user agents, or proxy rotation.

Let’s start with how to setup a simple Scrapy project.

Install Scrapy with the following command:

pip install Scrapy

Setup a new Scrapy project with the following command:

scrapy startproject quotes_scraper

To keep this tutorial simple, we will scrape data from quotes.toscrape.com.

First of all, we need to decide which data we need to scrape.

As you can see, each quote has a title, an author and some tags. So we will store the title, author and the tags (all tags in one string).

The first step to to define our items in the items.py file.

import scrapy

class QuotesScraperItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
pass

The next step is to write the code for our spider to crawl the webpage and scrape the data.

import scrapy
from quotes_scraper.items import Quote

class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
allowed_domains = ["quotes.toscrape.com"]

def start_requests(self):
urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
for quote_selector in response.xpath("//div[@class='quote']"):
quote = Quote()
quote["title"] = quote_selector.xpath("./span[@class='text']/text()").get()
print("Quote Title: ", quote["title"])
quote["author"] = quote_selector.xpath("./span/small[@class='author']/text()").get()
print("Quote Author: ", quote["author"])
tags = quote_selector.xpath("./div[@class='tags']/a[@class='tag']/text()").getall()
print("Quote Tags: ", tags)
quote["tags"] = "\n".join(tags)
yield quote

Let’s see what this spider does. First of all we import our Quote item. The allowed_domains will only allow the spider to enter this domain, no other. The name identifies the spider, and is used to run the spider from the console. Next we define the start_requests function, which has the URLs we need to crawl. This is the function where our spider starts its journey. For each of those URLs a Scrapy Request is made with the callback function parse. This request is asynchronous so it does not halt the flow of the program.

Now the next part is about Xpaths. There are actually 2 types of selectors we can use to extract data from a website, CSS and Xpath selectors. I find the Xpath selector easier to understand.

XPath expressions are used to select elements in an XML or HTML document. An XPath expression can be a path to an element, a combination of paths and operators, or a function call. Here are some examples:

  • Select all elements with a certain tag name: //tagname
  • Select an element with a specific ID: //*[@id=”myid”]
  • Select the first child element of a parent element: //parent/child[1]
  • Select all elements that contain a certain text: //*[contains(text(), “mytext”)]

Axes are used to select elements relative to the current element in an XPath expression. There are several axes available, including the child axis, parent axis, descendant axis, and sibling axis. Here are some examples:

  • Select all child elements of a parent element: //parent/child
  • Select the parent element of a child element: //child/..
  • Select all descendant elements of a parent element: //parent/descendant
  • Select the following sibling of a current element: //current/following-sibling::sibling

Predicates are used to filter elements based on specific criteria. Predicates are written in square brackets and can include functions, operators, and comparisons. Here are some examples:

  • Select all elements with a certain attribute: //tagname[@attribute=”value”]
  • Select all elements that are not empty: //*[text()!=””]
  • Select all elements that have more than one child: //*[count(child::*) > 1]
  • Select all elements that contain a certain word: //*[contains(text(), “word”)]
for quote_selector in response.xpath("//div[@class='quote']"):
quote = Quote()
quote["title"] = quote_selector.xpath("./span[@class='text']/text()").get()
print("Quote Title: ", quote["title"])
quote["author"] = quote_selector.xpath("./span/small[@class='author']/text()").get()
print("Quote Author: ", quote["author"])
tags = quote_selector.xpath("./div[@class='tags']/a[@class='tag']/text()").getall()
print("Quote Tags: ", tags)
quote["tags"] = "\n".join(tags)
yield quote

So, in this quote, first of all we select all the div tags with the class of ‘quote’. Then we make an object of the Quote class. Next we grab the title, the author and the tags, which are the children of the quote_selector. Use the text() method to get the string value. The tags are all stored in one string, separated with the \n or next line character, so we can easily get each of them when needed.

Finally we we yield the quote object, which sends it to the pipelines for processing. This is the third step in scraping this website.

The pipelines are used for processing and storing the data. In this tutorial, I will be using a Postgres database to store our data. Install the required packages:

pip install scrapy psycopg2

Our pipelines file looks like this right now:

import psycopg2
from scrapy.utils.project import get_project_settings

class QuotesScraperPipeline:
def process_item(self, item, spider):
return item

Here, we need to first connect to our database. Then we need a query to check if the quote we just yield is not already in the database. If not, the second query will store the quote in the database.

Because this is not the scope of this article, i have gone ahead and setup a a database with a quotes table. You will need to know basic SQL queries for this. To connect the database with Python Scrapy, the database name, user name, user password, host and port are required. By default, on a local enviroment, the user name is postgres, the host is localhost, the port is 5432.

It is a good practice to write these credentials in the settings.py file and retrieve them from there.

POSTGRES_DBNAME = 'web_scraper_development'
POSTGRES_USER = 'brainx'
POSTGRES_PASSWORD = '123'
POSTGRES_HOST = 'localhost'
POSTGRES_PORT = '5432'

While we are in the settings, let’s take a look at some features that Scrapy gives us out of the box.

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"

We can define a user agent (browser) for our scraper.

ROBOTSTXT_OBEY = False

We can choose to or not to obey the robots.txt file of a website, which allows web crawlers to access the site.

CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16

This configures the maximum concurrent requests performed by Scrapy.

DOWNLOAD_DELAY = 3

This configures a delay for requests for the same website.

ITEM_PIPELINES = {
"quotes_scraper.pipelines.QuotesScraperPipeline": 300,
}

Enable this to use the pipelines.

Now that we have written the credentials here, let’s move on to connecting the database and the queries.

import psycopg2
from scrapy.utils.project import get_project_settings

class QuotesScraperPipeline:
def __init__(self):
settings = get_project_settings()
self.connection = psycopg2.connect(host=settings.get('POSTGRES_HOST'), port=settings.get('POSTGRES_PORT'), user=settings.get('POSTGRES_USER'), dbname=settings.get('POSTGRES_DBNAME'), password=settings.get('POSTGRES_PASSWORD'))
self.cur = self.connection.cursor()

def process_item(self, item, spider):
result = self.find_quote(item)

if result:
spider.logger.warn("QUOTE ALREADY IN THE DATABASE")
else:
self.insert_quote(item)
spider.logger.debug("QUOTE INSERTED IN THE DATABASE")

def find_quote(self, product):
self.cur.execute("SELECT * FROM quotes WHERE title = %s", (
product["title"],
))
result = self.cur.fetchone()
return result if result else None

def insert_quote(self, product):
self.cur.execute(" INSERT INTO quotes (title, author, tags) values (%s, %s, %s) ", (
product["title"],
product["author"],
product["tags"],
))
self.connection.commit()
return

We get the settings, and in our initial function init, we connect with the database by providing the 5 credentials. When we yield quote in the spider, we receive that as an item in the process_item function. Here now, we first get a quote with a same title. If found, then we log a warning message in the console, otherwise we move on to inserting it. Pretty straight-forward.

Run the scraper by entering the folder that holds the pipelines and using the following command

scrapy crawl quotes_spider

This is the spider name we defined earlier.

The spider will go to each URL we specified, and get each record and store it in the database.

You can view the records from Postgres console.

This was just a basic example. There are many applications for web scraping.

For example, we need to scrape products from a store. They are divided into categories, and further into sub-categories.

Here is a theoretical example of how that could work.

base_url = "https://www.shop.com"

class ShopSpider(scrapy.Spider):
name = "shop_spider"
allowed_domains = ["www.shop.com"]

def start_requests(self):
url = 'https://www.shop.com'
print("Main URL: ", url)
yield scrapy.Request(url=url, callback=self.parseCategories)

def parseCategories(self, response):
for link in response.xpath('//div[@class="grid"]/div[@class="category-card"]/a/@href').get():
url = base_url + link
print("Category URL: ", url)
yield scrapy.Request(url=url, callback=self.parseSubCategories)

def parseSubCategories(self, response):
for link in response.xpath('//div[@class="grid"]/div[@class="sub-category-card"]/a/@href').get():
url = base_url + link
print("Sub Category URL: ", url)
yield scrapy.Request(url=url, callback=self.parse)

def parseProducts(self, response, offSet=24):
for link in response.xpath('//div[@class="grid"]/div[@class="product-card"]/a/@href').get():
url = base_url + link
print("Product URL: ", url)
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
product = Product()

product['store'] = "Shop.com"
print("Product Store: ", product['store'])

product['name'] = response.xpath('//h1[@class="name"]/text()').get()

product['price'] = response.xpath('//h3[@class="price"]/text()').get()

product['barcode'] = response.xpath('//p[@class="barcode"]/text()').get()

yield product

Here, we first crawl the main page, get the URL of each category, which is usually in the form ‘/fruits’ and ‘/vegetables’ so we need to concatenate it with the main URL. This way, we traverse all the categories, and for each these categories we traverse their sub-categories, and then finally for each sub-category we get their products. In the parse function, we extract the attributes we need.

Python Scrapy can be made even more better with the many plugins available for dynamic content, proxies, fake user agents and much more. It is a great framework designed specifically for web scraping.

--

--