Web Scrap with Scrapy

Ashraful Alam
Oceanize Lab Geeks
Published in
2 min readFeb 1, 2018

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

This tutorial will walk you through these tasks:

  1. Install Scrapy
  2. Creating a new Scrapy project
  3. Writing a spider to crawl a site and extract data
  4. Test spider

Install

Scrapy runs on Python 2.7 and Python 3.4 or above. To install Scrapy using pip.

pip install Scrapy

Create a Project

After install scrapy you have x to run those command to create new project.

scrapy startproject test_scrapy

Writing a spider to crawl

The spider defines the initial URL (http://something.com/). The spider must define these attributes:

  • name: the spider’s unique identifier
  • start_urls: URLs the spider begins crawling at
  • parse: method that parses and extracts the scraped data, which will be called with the downloaded Response object of each start URL
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem


class MySpider(BaseSpider):
name = "something"
allowed_domains = ["something.com"]
start_urls = ["http://something.com"]

def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
for titles in titles:
title = titles.select("a/text()").extract()
link = titles.select("a/@href").extract()
print title, link

Test

Now you are ready to run the scraper. So, while in the root directory of your Scrapy project, run the following command to output the scraped data to the screen:

$ scrapy crawl something

Wrapping Up

This is just a simple web crawler tutorial. There are still some powerful things you can do by just customizing this basic script

--

--