Crawling the Web for a Search Engine
an introduction to sleuthing the web with Scrapy
A while back, I worked in a two-man team with Bruno Bachmann on Sleuth, a UBC Launch Pad project to build a domain-specific search engine. This project included building everything from the website and server to a scraper that would handle populating our database with websites to search. The goal was to be able to search for detailed UBC-related content, particularly those on obscure course sites and the like, that have a hard time getting surfaced by search engines like Google. This post will go over how we implemented our crawlers and scrapers that curated content for this search engine.
To do this, I decided to use Scrapy which, from what I gathered, is more or less the go-to tool for all your web scraping needs.
scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.github.com
However, most of the common use cases for Scrapy involved working with specific websites and predictable page structures, which wasn’t really all that helpful for this particular use case. Sleuth was also my first “real” software project, and my first time using Python, and my first time doing any sort of scraping… so I had to sort of had to stab about in the dark for a solution.
The thing I struggled with the most at first was how to gather data that accurately summarizes each web page a crawler would have to traverse. These web pages could take a nearly infinite number of forms, making them rather difficult to describe. For example, a Wikipedia page might easily be described by its title and first few paragraphs, both of which are easily recognizable elements of an HTML document. Those two pieces of information will usually provide a reasonably accurate summary of a page, making it easy to surface this result in relevant contexts.
But what about a Reddit post? Is that best described by the post content, the post title, the comments, the subreddit name, the sidebar, or a combination of the above? Should karma play a role in how to prioritize this post in results? Do we want searches for “procrastinating” to return this page just because it is in the sidebar?
For more UBC-specific pages, it gets even tricker:
The above screenshot is from UBC’s course selector. When would Sleuth want to display this page in a search result? The title and description works fairly well, but in this case they are both pretty non-standard HTML elements, so we can’t just grab it the same way we grab content from a Wikipedia page. What about section data? Do we want this clumped as one result, or have 100 results, with each “BIOL 200” section as its own result? Link traversal is frustrating too — the easy solution is to simply visit every link on every page, but “Save To Worklist” here is a link too… one the crawler probably won’t want to visit.
Google itself seems to cheat a bit around some of these problems by offering the Google Search Console, where you can request hits from Google crawlers and learn how to have “rich results” show up when someone makes a search that includes your website. You do this by using metadata tags in specific ways that Google’s crawlers can look for and retrieve, which seems to be how we get these nice, neatly formatted search results.
So I built my solution along those lines by defining a fixed number of categories. At the top level, I set up two crawlers: a
broad_crawler that would traverse all the links in the pages it visits, and a
custom_crawler that would decide on an appropriate
parser module to handle web pages that conform to an expected format. Along the way, each page’s child links will also be tracked to allow us to form “links” between results, regardless of their type. These crawlers can then identify pages based on their content and detect the best parser to use to collect data. This approach has the advantage of modular, and new types can be added at any time — although it is pretty manual.
The data flow looks a bit like this:
I made the distinction between the
broad_crawler and the
custom_crawler because UBC course data had to be crawled in a very specific manner from the UBC course site, and we wanted to be able to retrieve very specific information (such as rows of tables on the page for course section data). The idea was that
custom_crawle would be an easily extendable module that could be used to target specific sites. Because of this, the
course_crawler itself was pretty simple, and to start it up I could just attach the appropriate parser and let it run free:
broad_crawler is where the modularised
parser design really shined, I think, allowing me to dynamically assign parsers after processing each request. I also set up some very rudimentary filtering when retrieving a page’s links:
def process_request(self, req):
if 'reddit.com' in req.url:
req = req.replace(priority=100)
if 'comments' in req.url:
req = req.replace(callback=self.parse_reddit_post)
req = req.replace(callback=self.no_parse)
We also set up a few “datatypes” that would represent our web pages and what kind of data we wanted to retrieve:
url = scrapy.Field()
title = scrapy.Field()
There are a wide variety of tags to rely on for most pages. For our “generic” pages, I didn’t need to get too in depth — some simple descriptive metadata would be sufficient. Some examples:
# normal title
# OpenGraph title
# OpenGraph description
I think there are probably a few other metadata systems besides OpenGraph that can be leveraged for interesting metadata, though I only got around to implementing one. Again, xpath was my friend here — I used Chrome’s inspector to quickly pick out xpath elements I needed.
I could get a bit more in-depth with some of the more specific sites. For example, for Reddit posts, I could make post parsing dependent on karma, and retrieve comments above a certain karma threshold as well.
post_section = response.xpath('//*[@id="siteTable"]')
karma = post_section.xpath(
if karma == "" or int(karma) < POST_KARMA_THRESHOLD:
Each parser creates a Scrapy item (in this case, a
ScrapyRedditPost) and populates its fields with the data retrieved from crawled page. Perhaps the Reddit API would have have been easier, but I felt that parsing Reddit as standard web pages would be the most organic way to gather “interesting” links, for example any links that might be in a post or comment. More importantly, the Reddit API is likely rate-limited.
Performance is pretty important when it comes to web crawling for a search engine. The more data you have on store, the higher the chances that you will have good, interesting results. For us, more data also meant more links, which was the foundation of how we wanted our data to be displayed on the Sleuth frontend.
In the interest of that, I made a few tweaks to the Scrapy settings and pipelines (which I won’t go over in this post). These two areas were more or less the only places I could realistically make optimizations — we didn’t have the time, skill, or resources to set up systems like distributed crawling, so we stuck with the basics.
The first thing I wanted to change was depth priority. Because we start with a few seed URLs (scroll back up to the flowchart for a reminder), I didn’t want Scrapy to spend all our system resources chasing links from the first seed URL, so I reduced the depth priority so that the crawlers would be able to get a greater “breadth” or results from a wider range of sources.
# Process lower depth requests first
DEPTH_PRIORITY = 50
I also allowed Scrapy to abuse my laptop for resources:
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
# Increase max thread pool size for DNS queries
REACTOR_THREADPOOL_MAXSIZE = 20
And I changed a few other things that might slow down crawling:
# Disobey robots.txt rules - sorry!
ROBOTSTXT_OBEY = False
# Reduce download timeout to quickly discard stuck requests
DOWNLOAD_TIMEOUT = 15
# Reduce logging verbosity ('DEBUG' for details)
LOG_LEVEL = 'INFO'
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
As far as crawling manners go (yes, there seems to be crawling manners! Scrapy includes a “crawl responsibly” message on its default settings), this is pretty rude. Sorry, site admins. 😛
🔍 Further Reading
I have also written about this in a bit more depth about the crawlers mentioned before here, and you can read a quick rundown of how the pipeline works here. The Scapy documentation is a must-read as well.
You can also check out the Sleuth repositories to learn more about this particular project! Development has ceased, but it was a pretty fun learning experience, and we got some pretty nice code coverage in the end. 🔥 🎉