Things you should know before you build a web crawler
Part 2: Writing a quick and dirty web crawler in Elixir is published!
So you have some web project that you need to scrape data for? This is fairly common and elixir’s process-based concurrency is excellent for this, especially backed by a job queue for robustness but, before we walk through an implementation let’s talk about things you should know before writing a crawler.
First off let’s discuss a simple architecture for a web crawler.
- A queue that various workers pull from; those workers parse the webpage and extract elements of interest such as links and content.
- The content is then indexed either into a conventional relational database, a document store or something such as an open-source search engine like Elasticsearch or Apache Solr.
- Any links that are discovered are again added to the queue for workers to continue to process.
- The workers process URLs added to the queue until either the queue is empty, some depth has been hit by your worker or you hit a cycle and wind up back on a website you’ve already crawled.
- Rinse and repeat
While this is a simple enough structure there are a lot of nuances to the implementation.
- How many workers do you run?
- How do you ensure you are not overloading a website you are crawling?
- Are you respecting their robots.txt?
- What is the depth limit of the crawl and how do you ensure you are only crawling unique webpages?
- What if your app crashes how do you pick up from the same spot in your crawl without losing data or position?
See it can get complex pretty quickly. Then you have to consider your data storage, which is a bit outside of the scope of this article but I will follow up with another article discussing just that bit. Anyway, let’s answer some of the questions I posed above.
Number of workers
For the number of workers, you generally want to consider the hardware constraints, for instance, my laptop has a 6 core processor, with 2 virtual cores per physical core. That means I have 12 cores total so If I had to figure out how many workers I would run I would start at 12 and tune-up or down depending on performance. In the crawler we are going to build, each worker is a separate process on the Erlang virtual machine which is what runs your elixir code after it is compiled into byte code. Elixir via erlang has first-class support for asynchronous processing, what this means is that when your worker process is making a request and waiting on a response the virtual machine can switch execution over to another process and handle that one’s work while waiting for a response to the request in the other process. Elixir is smart and can do this for you through the Erlang virtual machine.
This is why I said on my machine 12 workers is a good place to start, 12 processes can be working at any time without fighting for resources. That said the processes inside of Elixir are started inside the Erlang VM which has one process scheduler per core. This means you can likely tune to many more processes running concurrently without resource contention or slowdown. For one of the crawlers I wrote recently for Whize, I was running 50 or so processes at once with no slowdown. And finally, if you have fewer processes than cores you wind up with the opposite problem of not doing enough work with idle cores. This has the same result of not maximizing the speed at which you could be crawling.
Not overloading the site.
Okay, so that answers how many workers, how about not overloading the websites you are crawling? The general thought to be a respectful crawler is you should treat the website as if it was experiencing normal traffic from regular users. You never want to slow down or take down a website with your crawler. The simplest thing to do here, is use some sensible delay between requests that you would expect a regular user to make. That said you can usually go a bit faster while still respecting the site and the easiest way to do that is something called throttling.
If you are noticing a slowdown from the website then the simplest form of throttling is by tracking your response times and simply slowing down how quickly you are processing URLs against that site by some multiple of seconds across all your workers. If you continue to notice a slow down you then double or triple, etc the number of seconds you are waiting before making a request against their site again. Sometimes after a short period, you can reduce the blanket slowdown and resume normal operation, other times if you are crawling a single large site then leaving it slow is okay. Check back in a few hours or days. It’s better than attracting the attention of the site’s owner especially since web crawling is still a grey area legally.
There are other methods for throttling such as per request automatic throttling which is more complex but ensures the best request rate for your crawler while still being respectful of the site. In this method, you track the response time from every outbound request against a particular site and alter the delay of the next request that goes out from that group of workers by adding the difference between that average response time you are tracking and the response time of your last request. This allows your workers to make requests relative to your tracked baselines and alter the speed automatically of each request so that if you experience slowdowns you slow down with it and speed back up when the issue is resolved.
Respecting the robots.txt
Many websites have a file on their server called robots.txt. This file indicates what sections of the website that they would like you to limit your crawler too. Sometimes the website specifies they do not wish to be crawled at all. Whatever the case may be respecting the robots.txt helps to ensure you run into the least amount of problems. That said there is nothing that enforces the robots.txt it is up to you to be a responsible developer and respect it, or not, but understand the potential consequences that can arise legally from not doing so. If there is no robots.txt you should still treat the website respectfully and try to limit the crawl to the information you need specifically.
It is up to you to be a responsible developer and respect the robots.txt, or not, but understand the potential consequences that can arise legally from not doing so.
I would be lying if I said that I had not crawled websites that marked certain sections of their website as off-limits to crawling. This is a grey area. I make no recommendation one way or the other and it is up to you to understand the potential consequences of doing so. That said, technically if that area of the website is public then there are a few legal cases floating around that may protect your ability to crawl those pages regardless of the robots.txt. However, I am not a lawyer so the above is not legal advice and those cases even providing some measure of precedent do not protect you absolutely from being dragged into court for crawling something that a company or person doesn’t want you to be crawling. They can still make your life really unpleasant for a while even if you are ultimately in the right.
There are other less self-preserving reasons to respect the robots.txt as well. Depending on the website/company you may be creating a headache for some of their staff. Sometimes things are on the do not crawl list because they know that the service can only really handle so much traffic and if you’re not careful you could bring that service down. In any company I’ve worked for you’ve now gotten somebody out of bed if it's at night in a response to their on-call duties or if it's during the day you’ve likely disrupted whatever they had planned to work on for that day while they try to figure out how to ban your crawler and or send you a nasty email or legal cease and desist. Side note here: You should never crawl a website during peak hours for that website. You are now adding traffic on top of their normal traffic and you could cause issues and potentially lost revenue if you bring their site down which would be affecting people’s jobs.
You should never crawl a website during peak hours for that website.
Helping your crawler be the best crawler it can be
We’ve covered how fast you should crawl and what you should and shouldn’t crawl. There are a few things left that I think I should mention:
- How to keep your crawler from crawling infinitely
- Keeping your crawler on target so you aren’t visiting domains with information you don’t want
- Handling crashes robustly so you can pick up where you left off.
Keeping your crawler on task and within some limits
Many times when you are writing a web crawler you have some specific goal in mind with a particular subset of information you are looking to extract. Usually, this means you either only have one website you want to crawl or a small set of similarly related websites but more often then not you are not attempting to index the whole internet (unless you’re a search engine like us). This means you want to keep your crawler confined to a set of domains. The easy way to do this if you have a known set of domains you want to explore keep the domains in a list and before adding new domains to the queue simply use the URL parser in your language (in elixir this is URI.parse()) and compare the hostname with the domains in your list. If it exists in the list add the new URL to the queue if it doesn’t then skip adding the URL. Very simple.
Where things get tricky is when your set of domains is not so neatly contained. This is where depth limits come into play. In this scenario, you add every URL to your queue and keep track of how many links deep your crawler has gone. When your crawler hits some number you’ve specified you finish crawling that URL but don’t add any your crawler finds to the queue. This ensures you have some limit and your crawler doesn’t go too far off the rails. Of course, this means you’ll inevitably get some garbage data but web crawling is rarely neat. The third option is to create a blacklist of domains you don’t want your crawler to explore. This option is generally much harder because the number of domains you don’t want to crawl is generally much higher than you would really be able to keep in a list or even know about it, that said for things like known advertising domains or links to social media and things of that nature it may be a good option.
Generally speaking, a combination of all three methods will on average get you pretty good results when you have some idea of the domains you need to be on but branching out might reward you with a larger amount of data that you wanted to scrape. The above set of methods is not exhaustive and nor are they fool-proof but they are simple enough and cover a wide enough surface area that they should get you a lot of mileage.
Handling and recovering from crashes
You’ve been crawling your targets for hours, there is a huge list of URLs queued up to be crawled. Satisfied you go to bed. In the morning you see your crawler had a memory leak and at some point in the night the process was killed or your box crashed.
From here two things can happen:
- Scenario A — You realizing this may happen dumped your queue of urls to a json file every 5 seconds and added a mechanism to load in the file to pick up where you left off. Knowing this you fix the leak, start the crawler up and load the file going on your way knowing you’ve only lost a little time.
- Scenario B — You not realizing this may happen did not dump your queue of urls to a json file every 5 seconds and have no mechanism to load the state of the crawler back. You fix the leak, start the crawler and realize you’ve wasted the night as you have to start from the beginning again to get back to the place in the queue your crawler was before it died.
You definitely want to be the person in Scenario A.
The method discussed above is only one way (and a simple or naive one at that) to recover from crashes. Many storage technologies inherently take care of the task of saving to disk for you so if a crash were to occur you could simply load it back up by querying from them. Things like document stores such as MongoDB, distributed DBS such as Riak and classic relational databases like PostgreSQL all write to disk at some point in their operations. Other methods you can leverage are using technologies like Redis and flushing to disk every so often. There are many more technologies you can use as well. However, each one of these choices has trade-offs that are outside the scope of this article and often come with decisions between speed and fault tolerance.
That said regardless of what you decide to go with it is important to have a method of restoring your progress from some form of storage. You can even get fancy and rig some things so that if your crawler dies you can restart the process at the OS level and load from your store of choice. Have a safety net because your crawlers are very likely to crash.
Let’s Recap!
- There are many nuances to web scraping, and some are specific to the topic you are crawling.
- The number of workers your scraper should be using is at the minimum the number of cores on your machine. Beyond that is a matter of tuning through trial and error.
- Make sure to crawl your target as if you were a regular user and never overload the website. Do your crawl in off-peak hours.
- Respecting the robots.txt is important, you don’t have to but it is your responsibility to understand the consequences if you don’t
- Limiting your crawler to a minimal set of websites for your target information and making sure it doesn’t go in circles is key to getting the information you want versus garbage
- Whether you write to a database and use that as your source of truth or keep everything in memory and dump to a flat-file occasionally, you need a mechanism to recover from crashes. They both have their tradeoffs.
- Have fun with it! Web crawling is a powerful tool to have in your kit as a developer you can build some really cool things if you do it right.
What’s next?
Okay, so these are some of the high-level things that I believe are very important to know to make sure you have the best experience writing a web-crawler that you can have. In the next article, I will show you how to put together a web crawler using the simple concepts I discussed above in Elixir. This series came about because I thought I would share a simplified version of the work I and my friend are doing for an early alpha of our search engine, Whize so if you are interested in a search engine that focuses on early small businesses and independent content creators sign up there to receive notifications and check out my other article:
Existing search engines fail independent and small businesses. Enter Whize.
Part 2 of this three-part series will be released in the following weeks where I will show you how to build a quick and dirty web-crawler in elixir from start to finish.
If you enjoyed this please clap and or subscribe! Thanks!
Check out Part 2: Writing a quick and dirty web crawler in Elixir