Crawling the web with TOR
In this article I will present an example of a resilient crawler that is able to change its IP on demand.
Keywords: Python, Mac OS X, Tor, Proxy, IP rotation
In my recent work I scrapped a lot of web data.
When I built my first crawler, it was very basic so it was easily detected as a bot. By the time, I learned some tricks to avoid detection. The main idea is to make the crawler simulate the behaviour of humans. So I tried the following methods:
- Sleeping randomly: Waiting a few seconds between requests in the same website showed to be really effective. This is kind of intuitive because when someone is browsing, he spends time in order to consume the content of a page. So switching between pages instantly isn’t very human like.
- Putting a user agent of a valid browser: This is a must do!
Because if you don’t do it, you are either not describing yourself to the website and that’s suspicious, or you are actually sending him a user agent declaring that you are robot. - Switching between user agents: In my experience this works poorly and some times makes things worse. This is because if you keep the same IP but change the user agent, you are telling the website that you are constantly switching to different devices and browsers, and this is suspicious because no human does that.
These tricks were really helpful for most of the websites. However for some other, the challenge was a little harder.
After so many trials I understood the following:
Why avoid the bot detection when you can counter attack?
So the idea here is to stop finding ways to avoid detection, and find a way to respond to it.
Solution: IP Rotation
When a website flags you as a robot, he actually flags your IP. So if you change your compromised IP with a new one you’re actually in the clean. Ok cool, but how to do it? So I came across an awesome tutorial from 2014 called “Crawling anonymously with Tor in Python” which explains everything. Plus, it uses the Tor network, so you are completely anonymous.
Requirements
In this article I’m doing the Mac OS X version. (Here’s the same configuration for Linux)
Brew
First, you need brew
aka the missing package manager of macOS.
Tor
brew update
brew install tor
Next, do the following:
- Enable the ControlPort listener for Tor to listen on port 9051, as this is the port to which Tor will listen for any communication from applications talking to the Tor controller.
- Hash a new password that prevents random access to the port by outside agents.
- Implement cookie authentication as well.
And this is how it’s done:
You can create a hashed password out of your password using:
tor — hash-password my_password
Then, update the /usr/local/etc/tor/torrc
with the port, hashed password, and cookie authentication.
# content of torrc
ControlPort 9051
# hashed password below is obtained via `tor — hash-password my_password`
HashedControlPassword 16:E600ADC1B52C80BB6022A0E999A7734571A451EB6AE50FED489B72E3DF
CookieAuthentication 1
Restart Tor again to the configuration changes are applied.
brew services restart tor
Privoxy
Tor itself is not a http proxy. So in order to get access to the Tor Network, use privoxy
as an http-proxy though socks5.
Install privoxy
via the following command:
brew install privoxy
Now, tell privoxy
to use TOR by routing all traffic through the SOCKS servers at localhost port 9050. To do that append /usr/local/etc/privoxy/config
with the following
forward-socks5t / 127.0.0.1:9050 . # the dot at the end is important
Restart privoxy
after making the change to the configuration file.
brew services restart privoxy
Stem
Next, install stem
which is a Python-based module used to interact with the Tor Controller, letting us send and receive commands to and from the Tor Control port programmatically.
pip install stem
Aaaand that’s it for the requirements 🎉
It’s clearly not the easiest configuration, but it’s really worth the hussle.
Example Script
In the script below, urllib
is using privoxy
which is listening on port 8118 by default, and forwards the traffic to port 9050 on which the Tor socks is listening.
Additionally, in the renew_connection()
function, a signal is being sent to the Tor controller via port 9051 to change the identity.
And here is the TorHandler
:
I have also created a repo for convenience.
Final words
With this code you should be able to renew your IP whenever you need to.
However you should note that since we are using Tor, there is a risk that some websites will have suspicions and even block you in some cases because they simply dislike Tor users.
Anyways, enjoy 🔥
References
Both article and code are heavily inspired from the two following resources: