How to build a scaleable crawler to crawl million pages with a single machine in just 2 hours
There’ve been lots of articles about how to build a python crawler . If you are a newbie in python and not familiar with multiprocessing or multithreading , perhaps this tutorial will be right choice for you.
You don’t need to know how to manage processing or thread or even queue, just input the urls you want to scrape, extract the web structure as you need , change the number of crawlers and concurrencies to generate, and the rest of all is “Fire it up”!
All the code here just simulate an efficient distributed crawler, and learn how to use docker and celery . There are a few things to declare before we get started:
1. No matter which website you want to scrape, please obtain its robot policy and TOS.
2. Don’t send large requests to the same website at one time. Please be gentle.
3. Don’t do anything that violate local law of you country
This tutorial is a upgrade version of previous post, How to build docker cluster with celery and RabbitMQ in 10 minutes. There are lots of technical details about how to write a dockerfile , use docker-compose, and how to configure Celery and RabbiMQ , so I won’t talk about them again.
Because of Docker, we could scale up any applications easily. So there are only two key files in this article. Check the previous post for more details.
List of urls: you could get a Alexa top 1 million domain list from this website, store them into your database or text file as your need. In order to have a quick test , I just build a nginx hello page in my cloud server. Then scale it up to a list of 1000000.
Then use longtime_add.delay method to send all of them to RabbitMQ, a message broker.
With a command of ‘docker-compose scale worker=10’, docker-compose will generate a worker cluster with this “tasks” file. It will get a list of urls and send it to requests, return the status_code , and store status_code with time.time() method. If it works , you will see something like this in Mongo database:
Generally，because of the network of your home or server host , connections are not always stable, once it fails we have to try again. So we set a retry delay to 10 seconds here. Once failling it will retry 10 second later.
2.Configure dockerfile and docker-compose
ENTRYPOINT celery -A test_celery worker — concurrency=20 — loglevel=info
Something should be noticed here: do not set the concurrency in dockerfile too high. In this case, 20 is big enough for my machine.
What we should add in docker-compose file is mongo database, and set the port ‘27018:27017’ . The first number should be the same as the one in tasks.
3. Let’s Run
My local machine has two CPUs , while each one has 12 threads and 6 cores, and total memory is 32G . It’s powerful enough to run 40 workers while each one has 20 concurrencies.
The more concurrencies for each worker, the more memory it will need. For this case , the 40 workers cost about 12G memory totally, and about a half left. If I increase the number of concurrencies or workers a little bit, the speed of crawler will be faster, however the bandwidth of network in my house is just 1 mb/s. Though increasing of workers, the speed doesn’t increase at all. In this test , the bottleneck is my poor network.
40 workers and 15 minutes later, I got about 100000 items in my Mongo database, using a single machine.
If you have a powerful server or pc, it's necessary to deploy docker clusters to maximize performance of your machine, or with some small server such as Raspberry Pi , docker is a better choice too. Of course , without docker you could build a distributed crawler too. Some software like Fabric will deploy your applications to server clusters with few commands.
If you have any questions or suggestions about it , please feel free to drop response here , and welcome to submit pull requests to this project. See you next tutorial!