In former article How to build a scaleable crawler to crawl million pages, I wrote something about building a scaleable crawler with Docker-compose. Just imagine a scenario that you have lots of servers around the world, do you still need to install requirements , configure system and run the script one by one? Docker and Docker-Compose couldn’t help you out, and docker swarm and k8s seems more complicated for such a project.
So which software should we choose to accomplish it quickly?
As mentioned in the last paragraph of previous post, it’s Fabric.
Fabric is a Python (2.5–2.7) library and…
There’ve been lots of articles about how to build a python crawler . If you are a newbie in python and not familiar with multiprocessing or multithreading , perhaps this tutorial will be right choice for you.
You don’t need to know how to manage processing or thread or even queue, just input the urls you want to scrape, extract the web structure as you need , change the number of crawlers and concurrencies to generate, and the rest of all is “Fire it up”!
All the code here just simulate an efficient distributed crawler, and learn how to use…
There are lots of tutorials about how to use Celery with Django or Flask in Docker. Most of them are good tutorials for beginners, but here , I don’t want to talk more about Django, just explain how to simply run Celery with RabbitMQ with Docker, and generate worker clusters with just ONE command.
Of course , you could make an efficient crawler clusters with it !
Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.