From 0 to 1: how to build a web crawler from scratch by python. Part I.

Lena Li
Lena Li
Nov 4 · 7 min read

Why am I doing this

I was leetcoding and learning system design over the past few months. Solving problems in leetcode is fun, but system design? Hmm… I learned the elements of a system by pieces and tried to build a system as an architect. But still, the concepts, differences between technologies, pros and cons of different models, all these things are boring and hard to remember.

One Friday night, I was reading some articles regarding multithreading, for the third time. I was tired and thought, why not working on something that I felt genuinely interesting and awesome? Like to build a search engine? Nuh, build a google is too hard. But how about starting from a simple web crawler?

Here’s how I did it just for fun.

Finished product:

link: MyFirstWebCrawler

search engine for books
Search results

What is web crawler

Web crawler is a technique to gather data from the website. You can use it to build a search engine like google of course, or just to track price change of the shoes you like. Web crawlers are a great way to get the data you need.

Before we start to build the web crawler, first, you need to know how web crawling works. The basic architecture is shown in the following graph. The web crawler, or web spider, is really just an application to scan the World Wide Web and extract information automatically. It’s as simple as a set of seed URLs as input, and get a set of HTML pages (data) as output. With this idea, we will build our web crawler with 2 steps: 1. Grab destination URLs; 2. Extract data from URLs.

The architecture of Web Crawler. Resources: WEB CRAWLER FOR MINING WEB DATA

Rules before we begin

Before scraping, we need to know basic “rules” for web crawling

1. First of first, please read the Terms and Conditions of the website. Some websites clearly mention prohibiting web crawling without permission, or mention some legal or copyright aspects related to the use of its data. Please make sure your crawling is ‘legal’.

2. Please be ‘polite’ to the website! Do not unnecessarily disable robots.txt of the website. Please space out your requests a bit so that you won’t hammer the server. Please run your spider off-peak traffic hours of the website.

3. Some web crawling or other robot activities are obviously illegal if they cause any direct or indirect damage to the company owning data. Please employ some common sense before crawling.

Resources and tools:

Scrapy: the tool I used here to build this web crawler. You can follow the tutorial to install it in Windows or Linux.

MongoDB: The database I used to save the data extracted from the website. You can get a 500M free account, it’s good enough for this project.


Part I: Grab URLs

For spiders, the scraping cycle goes through something like this: (resource: https://docs.scrapy.org/en/latest/topics/spiders.html)

  • You start by generating the initial Requests to crawl the first URLs and specify a callback function to be called with the response downloaded from those requests. The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.
  • In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
  • In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, xml or whatever mechanism you prefer) and generate items with the parsed data.
  • Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

Now let’s begin scrawling with Scrapy. We will start from a scrape friendly website, a books website. It’s a demo website for web scarping.

http://books.toscrape.com/
  • start a project:

You can name your project as any name you like. Here I name it as ‘bookstore’.

scrapy startproject bookstore
  • Build a spider:

We will generate a new spider to crawl all the books from the website.

scrapy genspider books books.toscrape.com

Please note, don’t include www. or Http: in this command line, or it will have an error.

Now you will have a folder to have __init__.py file, other .py files, and a folder to hold your spider python file.

Since this website does not contain a robot.txt, you need to disable ROBOTSTXT_OBEY in settings.py

Tip: if you want to test your syntax, you can open Scrapy shell and test your code line by line:

scrapy shell http://books.toscrape.com/
scrapy shell

Now open your spider .py file, and start coding. You can start with rules in the class, and add several attributes and test what’s the difference in the results. For example, follow = True will loop through all the pages automatically.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BooksSpider(CrawlSpider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
# add attributes
rules = (Rule(LinkExtractor(deny_domains = ('google.com')), callback = 'parse', follow = True),)
# default function
def parse(self, response):
pass

To run your spider code, you can text:

scrapy crawl books

You will see the results after crawling:

spider results

You can check your crawling results, how many books are crawled, any bug or not. In DEBUG, 200 means succeed, 404 means fail. If you see 404 in your results, you need to debug your code.

Next, let’s inspect the elements of the webpage and check how to extract info from it. Right click on the info you want to extract, it will show “inspect” in the list, click that, you can find the element.

For example, if we’d like to extract all the hyperlinks of the books, we need to extract the link in <h3> tag, then <a> tag, and finally extract from href tag.

here’s how xpath code works in shell:

xpath code to extract all the URLs

xpath is a XML path language, you can check this tutorial for xpath syntax:

Ok, now we have all the URLs on the home page. How to extract all the URLs of 50+ pages? Yes, use the ‘next’ button! Let’s inspect the element of ‘next’ button:

inspect “next” element

Let’s test the code to extract “next” in scrapy shell:

Let’s put these lines together to extract all the URLs:

from scrapy import Spiderfrom scrapy.http import Requestclass BooksSpider(Spider):name = ‘books’allowed_domains = [‘books.toscrape.com’]start_urls = [‘http://books.toscrape.com']def parse(self, response):books = response.xpath(‘//h3/a/@href’).extract()for book in books:absolute_url = response.urljoin(book)yield Request(absolute_url, callback=self.parse_book)# process next pagenext_page_url = response.xpath(‘//a[text()=”next”]/@href’).extract_first()absolute_nxt_page_url = response.urljoin(next_page_url)yield Request(absolute_nxt_page_url)def parse_book(self, response):pass

parse(response) function is the default callback used by Scrapy to process downloaded responses when their requests don’t specify a callback.

The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the spider class.

This method, as well as any other Request callback, must return an iterable of request and/or dicts or item objects.

Let’s test it with

scrapy crawl books

Now we have crawled all the 50+ pages from the website, 1050 URLs in total. It’s as simple as that. In the next section, let’s extract all the information from these URLs and export those into MongoDB.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade