
Why am I doing this
I was leetcoding and learning system design over the past few months. Solving problems in leetcode is fun, but system design? Hmm… I learned the elements of a system by pieces and tried to build a system as an architect. But still, the concepts, differences between technologies, pros and cons of different models, all these things are boring and hard to remember.
One Friday night, I was reading some articles regarding multithreading, for the third time. I was tired and thought, why not working on something that I felt genuinely interesting and awesome? Like to build a search engine? Nuh, build a google is too hard. But how about starting from a simple web crawler?
Here’s how I did it just for fun.
Finished product:
link: MyFirstWebCrawler


What is web crawler
Web crawler is a technique to gather data from the website. You can use it to build a search engine like google of course, or just to track price change of the shoes you like. Web crawlers are a great way to get the data you need.
Before we start to build the web crawler, first, you need to know how web crawling works. The basic architecture is shown in the following graph. The web crawler, or web spider, is really just an application to scan the World Wide Web and extract information automatically. It’s as simple as a set of seed URLs as input, and get a set of HTML pages (data) as output. With this idea, we will build our web crawler with 2 steps: 1. Grab destination URLs; 2. Extract data from URLs.

Rules before we begin
Before scraping, we need to know basic “rules” for web crawling
1. First of first, please read the Terms and Conditions of the website. Some websites clearly mention prohibiting web crawling without permission, or mention some legal or copyright aspects related to the use of its data. Please make sure your crawling is ‘legal’.
2. Please be ‘polite’ to the website! Do not unnecessarily disable robots.txt of the website. Please space out your requests a bit so that you won’t hammer the server. Please run your spider off-peak traffic hours of the website.
3. Some web crawling or other robot activities are obviously illegal if they cause any direct or indirect damage to the company owning data. Please employ some common sense before crawling.
Resources and tools:
Scrapy: the tool I used here to build this web crawler. You can follow the tutorial to install it in Windows or Linux.
MongoDB: The database I used to save the data extracted from the website. You can get a 500M free account, it’s good enough for this project.
Part I: Grab URLs
For spiders, the scraping cycle goes through something like this: (resource: https://docs.scrapy.org/en/latest/topics/spiders.html)
- You start by generating the initial Requests to crawl the first URLs and specify a callback function to be called with the response downloaded from those requests. The first requests to perform are obtained by calling the
start_requests()method which (by default) generatesRequestfor the URLs specified in thestart_urlsand theparsemethod as callback function for the Requests. - In the callback function, you parse the response (web page) and return either dicts with extracted data,
Itemobjects,Requestobjects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback. - In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, xml or whatever mechanism you prefer) and generate items with the parsed data.
- Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
Now let’s begin scrawling with Scrapy. We will start from a scrape friendly website, a books website. It’s a demo website for web scarping.

- start a project:
You can name your project as any name you like. Here I name it as ‘bookstore’.
scrapy startproject bookstore- Build a spider:
We will generate a new spider to crawl all the books from the website.
scrapy genspider books books.toscrape.comPlease note, don’t include www. or Http: in this command line, or it will have an error.
Now you will have a folder to have __init__.py file, other .py files, and a folder to hold your spider python file.
Since this website does not contain a robot.txt, you need to disable ROBOTSTXT_OBEY in settings.py

Tip: if you want to test your syntax, you can open Scrapy shell and test your code line by line:
scrapy shell http://books.toscrape.com/
Now open your spider .py file, and start coding. You can start with rules in the class, and add several attributes and test what’s the difference in the results. For example, follow = True will loop through all the pages automatically.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractorclass BooksSpider(CrawlSpider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
# add attributes
rules = (Rule(LinkExtractor(deny_domains = ('google.com')), callback = 'parse', follow = True),)
# default function
def parse(self, response):
pass
To run your spider code, you can text:
scrapy crawl booksYou will see the results after crawling:

You can check your crawling results, how many books are crawled, any bug or not. In DEBUG, 200 means succeed, 404 means fail. If you see 404 in your results, you need to debug your code.
Next, let’s inspect the elements of the webpage and check how to extract info from it. Right click on the info you want to extract, it will show “inspect” in the list, click that, you can find the element.

For example, if we’d like to extract all the hyperlinks of the books, we need to extract the link in <h3> tag, then <a> tag, and finally extract from href tag.
here’s how xpath code works in shell:

xpath is a XML path language, you can check this tutorial for xpath syntax:
Ok, now we have all the URLs on the home page. How to extract all the URLs of 50+ pages? Yes, use the ‘next’ button! Let’s inspect the element of ‘next’ button:

Let’s test the code to extract “next” in scrapy shell:

Let’s put these lines together to extract all the URLs:
from scrapy import Spiderfrom scrapy.http import Requestclass BooksSpider(Spider):name = ‘books’allowed_domains = [‘books.toscrape.com’]start_urls = [‘http://books.toscrape.com']def parse(self, response):books = response.xpath(‘//h3/a/@href’).extract()for book in books:absolute_url = response.urljoin(book)yield Request(absolute_url, callback=self.parse_book)# process next pagenext_page_url = response.xpath(‘//a[text()=”next”]/@href’).extract_first()absolute_nxt_page_url = response.urljoin(next_page_url)yield Request(absolute_nxt_page_url)def parse_book(self, response):pass
parse(response) function is the default callback used by Scrapy to process downloaded responses when their requests don’t specify a callback.
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the spider class.
This method, as well as any other Request callback, must return an iterable of request and/or dicts or item objects.
Let’s test it with
scrapy crawl books
Now we have crawled all the 50+ pages from the website, 1050 URLs in total. It’s as simple as that. In the next section, let’s extract all the information from these URLs and export those into MongoDB.
