A crawler that beats bot detection

In this article I will present an example of a resilient crawler that is able to change its IP on demand.

Keywords: Python, Mac OS X, Tor, Proxy, IP rotation

Crawler life

In my recent work I scrapped a lot of web data.
When I built my first crawler, it was very basic so it was easily detected as a bot. By the time, I learned some tricks to avoid detection. The main idea is to make the crawler simulate the behaviour of humans. So I tried the following methods:

  • Sleeping randomly: Waiting a few seconds between requests in the same website showed to be really effective. This is kind of intuitive because when someone is browsing, he spends time in order to consume the content of a page. So switching between pages instantly isn’t very human like.
  • Putting a user agent of a valid browser: This is a must do!
    Because if you don’t do it, you are either not describing yourself to the website and that’s suspicious, or you are actually sending him a user agent declaring that you are robot.
  • Switching between user agents: In my experience this works poorly and some times makes things worse. This is because if you keep the same IP but change the user agent, you are telling the website that you are constantly switching to different devices and browsers, and this is suspicious because no human does that.

These tricks were really helpful for most of the websites. However for some other, the challenge was a little harder.
After so many trials I understood the following:

Why avoid the bot detection when you can counter attack?

So the idea here is to stop finding ways to avoid detection, and find a way to respond to it.

Solution: IP Rotation

When a website flags you as a robot, he actually flags your IP. So if you change your compromised IP with a new one you’re actually in the clean. Ok cool, but how to do it? So I came across an awesome tutorial from 2014 called “Crawling anonymously with Tor in Pythonwhich explains everything. Plus, it uses the Tor network, so you are completely anonymous.

Requirements

In this article I’m doing the Mac OS X version. (Here’s the same configuration for Linux)

Brew

First, you need brew aka the missing package manager of macOS.

Tor

brew update
brew install tor

Next, do the following:

  • Enable the ControlPort listener for Tor to listen on port 9051, as this is the port to which Tor will listen for any communication from applications talking to the Tor controller.
  • Hash a new password that prevents random access to the port by outside agents.
  • Implement cookie authentication as well.

And this is how it’s done:

You can create a hashed password out of your password using:

tor — hash-password my_password

Then, update the /usr/local/etc/tor/torrc with the port, hashed password, and cookie authentication.

# content of torrc
ControlPort 9051
# hashed password below is obtained via `tor — hash-password my_password`
HashedControlPassword 16:E600ADC1B52C80BB6022A0E999A7734571A451EB6AE50FED489B72E3DF
CookieAuthentication 1

Restart Tor again to the configuration changes are applied.

brew services restart tor

Privoxy

Tor itself is not a http proxy. So in order to get access to the Tor Network, use privoxy as an http-proxy though socks5.

Install privoxy via the following command:

brew install privoxy

Now, tell privoxy to use TOR by routing all traffic through the SOCKS servers at localhost port 9050. To do that append /usr/local/etc/privoxy/config with the following

forward-socks5t / 127.0.0.1:9050 . # the dot at the end is important

Restart privoxy after making the change to the configuration file.

brew services restart privoxy

Stem

Next, install stem which is a Python-based module used to interact with the Tor Controller, letting us send and receive commands to and from the Tor Control port programmatically.

pip install stem

Aaaand that’s it for the requirements 🎉
It’s clearly not the easiest configuration, but it’s really worth the hussle.

Example Script

In the script below, urllib is using privoxy which is listening on port 8118 by default, and forwards the traffic to port 9050 on which the Tor socks is listening.

Additionally, in the renew_connection() function, a signal is being sent to the Tor controller via port 9051 to change the identity.

And here is the TorHandler :

I have also created a repo for convenience.

Final words

With this code you should be able to renew your IP whenever you need to.

However you should note that since we are using Tor, there is a risk that some websites will have suspicions and even block you in some cases because they simply dislike Tor users.

Anyways, enjoy 🔥

References

Both article and code are heavily inspired from the two following resources: