A crawler that beats bot detection
In this article I will present an example of a resilient crawler that is able to change its IP on demand.
Keywords: Python, Mac OS X, Tor, Proxy, IP rotation
In my recent work I scrapped a lot of web data.
When I built my first crawler, it was very basic so it was easily detected as a bot. By the time, I learned some tricks to avoid detection. The main idea is to make the crawler simulate the behaviour of humans. So I tried the following methods:
- Sleeping randomly: Waiting a few seconds between requests in the same website showed to be really effective. This is kind of intuitive because when someone is browsing, he spends time in order to consume the content of a page. So switching between pages instantly isn’t very human like.
- Putting a user agent of a valid browser: This is a must do!
Because if you don’t do it, you are either not describing yourself to the website and that’s suspicious, or you are actually sending him a user agent declaring that you are robot.
- Switching between user agents: In my experience this works poorly and some times makes things worse. This is because if you keep the same IP but change the user agent, you are telling the website that you are constantly switching to different devices and browsers, and this is suspicious because no human does that.
These tricks were really helpful for most of the websites. However for some other, the challenge was a little harder.
After so many trials I understood the following:
Why avoid the bot detection when you can counter attack?
So the idea here is to stop finding ways to avoid detection, and find a way to respond to it.
Solution: IP Rotation
When a website flags you as a robot, he actually flags your IP. So if you change your compromised IP with a new one you’re actually in the clean. Ok cool, but how to do it? So I came across an awesome tutorial from 2014 called “Crawling anonymously with Tor in Python” which explains everything. Plus, it uses the Tor network, so you are completely anonymous.
In this article I’m doing the Mac OS X version. (Here’s the same configuration for Linux)
First, you need
brew aka the missing package manager of macOS.
brew install tor
Next, do the following:
- Enable the ControlPort listener for Tor to listen on port 9051, as this is the port to which Tor will listen for any communication from applications talking to the Tor controller.
- Hash a new password that prevents random access to the port by outside agents.
- Implement cookie authentication as well.
And this is how it’s done:
You can create a hashed password out of your password using:
tor — hash-password my_password
Then, update the
/usr/local/etc/tor/torrc with the port, hashed password, and cookie authentication.
# content of torrc
# hashed password below is obtained via `tor — hash-password my_password`
Restart Tor again to the configuration changes are applied.
brew services restart tor
Tor itself is not a http proxy. So in order to get access to the Tor Network, use
privoxy as an http-proxy though socks5.
privoxy via the following command:
brew install privoxy
privoxy to use TOR by routing all traffic through the SOCKS servers at localhost port 9050. To do that append
/usr/local/etc/privoxy/config with the following
forward-socks5t / 127.0.0.1:9050 . # the dot at the end is important
privoxy after making the change to the configuration file.
brew services restart privoxy
stem which is a Python-based module used to interact with the Tor Controller, letting us send and receive commands to and from the Tor Control port programmatically.
pip install stem
Aaaand that’s it for the requirements 🎉
It’s clearly not the easiest configuration, but it’s really worth the hussle.
In the script below,
urllib is using
privoxy which is listening on port 8118 by default, and forwards the traffic to port 9050 on which the Tor socks is listening.
Additionally, in the
renew_connection() function, a signal is being sent to the Tor controller via port 9051 to change the identity.
And here is the
I have also created a repo for convenience.
tor-ip-rotation-python-example - An example of Tor IP rotation in Pythongithub.com
With this code you should be able to renew your IP whenever you need to.
However you should note that since we are using Tor, there is a risk that some websites will have suspicions and even block you in some cases because they simply dislike Tor users.
Anyways, enjoy 🔥
Both article and code are heavily inspired from the two following resources:
There are a lot of valid usecases when you need to protect your identity while communicating over the public internet…sacharya.com