Web Scrap with Scrapy

Published in

Oceanize Lab Geeks

2 min readFeb 1, 2018

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

This tutorial will walk you through these tasks:

Install Scrapy
Creating a new Scrapy project
Writing a spider to crawl a site and extract data
Test spider

Install

Scrapy runs on Python 2.7 and Python 3.4 or above. To install Scrapy using pip.

pip install Scrapy

Create a Project

After install scrapy you have x to run those command to create new project.

scrapy startproject test_scrapy

Writing a spider to crawl

The spider defines the initial URL (http://something.com/). The spider must define these attributes:

name: the spider’s unique identifier
start_urls: URLs the spider begins crawling at
parse: method that parses and extracts the scraped data, which will be called with the downloaded Response object of each start URL

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem


class MySpider(BaseSpider):
    name = "something"
    allowed_domains = ["something.com"]
    start_urls = ["http://something.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        for titles in titles:
            title = titles.select("a/text()").extract()
            link = titles.select("a/@href").extract()
            print title, link

Test

Now you are ready to run the scraper. So, while in the root directory of your Scrapy project, run the following command to output the scraped data to the screen:

$ scrapy crawl something

Wrapping Up

This is just a simple web crawler tutorial. There are still some powerful things you can do by just customizing this basic script

Web Scrap with Scrapy

Wrapping Up

Written by Ashraful Alam