PySpider — Part 2

Learning the basics

Vorathep Sumetphong

3 min readFeb 11, 2018

Disclaimer: This is for educational purposes, I do not own any content on the website nor promoting it.

We will be learning the basics of PySpider in this part, but our end goal for this series is to crawl MovDB.

Let’s go!

Create a new folder for this project

$ mkdir MovDBCrawler; cd MovDBCrawler;

Start PySpider

$ pyspider

You should have pyspider dashboard running on port :5000

Click on the create button, give it a project name and a starting url, we are going to use the script mode as slime mode is still in development.

Our starting point

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Project: MovDBCrawlerfrom pyspider.libs.base_handler import *class Handler(BaseHandler): 
    crawl_config = {}    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('https://movdb.net/', callback=self.index_page)    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)    @config(priority=2)
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Understanding the methods:

Every method is a task

@every(minutes=24 * 60)
def on_start(self):
    self.crawl('https://movdb.net/', callback=self.index_page)

on_start: our initial task to create our first crawling task!? Yes!

self.crawl takes in a url and a method to run on the response returned(more on this later)

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
    for each in response.doc('a[href^="http"]').items():
        self.crawl(each.attr.href, callback=self.detail_page)

index_page: response given from the on_start method and it finds all the links(href) tags, loops each of them, crawl them with detail_page method

@config(priority=2)
def detail_page(self, response):
    return {
        "url": response.url,
        "title": response.doc('title').text(),
    }

detail_page: returns the url and title from the response object

Understanding annotations:

@every — task with this annotation will periodically be called after a time given to it, no more cron jobs!
@config(age) —task with this annotation gives time validity to task, ensuring that task will not run the same url again, until the age expires
@config(priority) — task with higher weight will be given more importance in queue

We will go through more annotations and method structure as we move forward.

Run!

We will be still running on editor mode*

Save and run.

Nothing?! Guess you didn’t notice..

Our on_start task is executed and has created another task waiting in the ‘follows’ tab.

Open the follows tab and there is a index_page task in orange, click on the play button to run the index_page task. Boom! All the links from the index page ready. You can use the detail_page method on one of them and it will print out the title of the page and its url.

Awesome!

Next part, we will start crawling MovDB, so do your research on the website and maybe come up with what data we can extract and separation of tasks.

Resources

If you enjoyed reading, there’s 50 ways (claps) to show your appreciation :)

Have a question or having trouble? Please leave a comment!

← Part 1