Web Scrap with Scrapy
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
This tutorial will walk you through these tasks:
- Install Scrapy
- Creating a new Scrapy project
- Writing a spider to crawl a site and extract data
- Test spider
Install
Scrapy runs on Python 2.7 and Python 3.4 or above. To install Scrapy using pip.
pip install Scrapy
Create a Project
After install scrapy you have x to run those command to create new project.
scrapy startproject test_scrapy
Writing a spider to crawl
The spider defines the initial URL (http://something.com/). The spider must define these attributes:
- name: the spider’s unique identifier
- start_urls: URLs the spider begins crawling at
- parse: method that parses and extracts the scraped data, which will be called with the downloaded Response object of each start URL
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "something"
allowed_domains = ["something.com"]
start_urls = ["http://something.com"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
for titles in titles:
title = titles.select("a/text()").extract()
link = titles.select("a/@href").extract()
print title, link
Test
Now you are ready to run the scraper. So, while in the root directory of your Scrapy project, run the following command to output the scraped data to the screen:
$ scrapy crawl something
Wrapping Up
This is just a simple web crawler tutorial. There are still some powerful things you can do by just customizing this basic script