Fighting back: Turning the Tables on Web Scrapers Using Rust
Ever wanted to mess with people scanning the web for vulnerabilities? I certainly did. This is the story how I found a way to punish them, then used Rust to improve it, and then killed my web server using a van.
Step 0: Getting Annoyed
Alright, so if you’ve ever run a website at any scale, and happen to look at the access logs, you will soon find that a lot of requests coming in has nothing to do with your website. A lot of them instead look at paths like /wp-login.php
, /.env
and /.git/config
. Turns out a lot of different people want to either steal your database password or try to login to your WordPress site. While not surprising, it is a bit annoying when you try to check stats of your site.
This is of course an automated process (or well, some maniac might do this manually, it’s a big internet after all). It won’t help updating your /robots.txt
(a file describing how bots are allowed to check your website), because no self-respecting password-stealing bot would ever bother to read it. However, big companies like Google do respect this file (with some exceptions). Could we somehow use this to our advantage?
Step 1: Finding the Gates of Hell
Of course we can do something about it! In looking into ways to mess with our annoying bot friends I stumbled upon HellPot, an HTTP honeypot designed to crash bots attempting to scrape a website by simply giving them what the asked for. Any HTTP request to HellPot on specified paths (like the aforementioned /wp-login.php
) will be met with an eternal stream of data from The Birth of Tragedy (Hellenism and Pessimism) by Friedrich Nietzsche, that kind of looks like a website. We just make sure to put the same paths in our robots.txt
to avoid bingbot experiencing Nietzsche at several MB/s.
How is this possible? Well, as it turns out, HTTP responses can be streams. Usually this is used while transferring large files. This is great for our purposes: We can just continue to generate on the same response forever, never breaking the stream and saying “Done!”.
Ok, so what? Well let’s consider some things:
- Most web scrapers are poorly written scripts
- Most web scrapers run on cloud instances
- Memory and storage are limited on these instances
- And so is bandwidth
So what happens when shiddy_wp_scraper.py
connects to a website that continues to send data eternally? Well, generally it just slurps it up! And if you slurp up more GB of data than you have memory, well then your OS will generally decide that it is time for some processes to die, and badabim badaboom you have +1 crashed web scraper. One can then just hope that the person that set it up forgot to add an automatic restart.
The process doesn’t have to crash though. But other options are also fun:
- Saving it to disk (will crash eventually, when your disk is maxed out!)
- Racking up bandwidth fees (if you are charged for this)
Note: If you have a metered connection yourself, or have limited bandwidth, I would not recommend this, since you might lose more than the web scraping pirates!
Step 2: Turning up the Heat (in Hell)
I really enjoyed crashing web scrapers using HellPot, but I felt like the whole process lacked a personal touch. I made some contributions to HellPot, but Go is not my first language, and it lacked some features I wanted. So what did I do? Well I rewrote it in Rust of course.
Some minor Rust hacking later I published pandoras_pot on crates.io. The basic principle is the same: Request, connect, drink kool-aid, crash. But I took the liberty to add some nice features I wanted:
- Better performance; Rust is blazingly fast after all. And safe!
- More ways of generating data. Why not send a static file over and over, or just random strings instead of Markov chain output (although it does support Markov chains via my Markov chain library markovish).
- You the user can now provide the source of pain. Don’t like Nietzsche? What about Kant? Yesterday’s newspaper? That love note you sent your crush in third grade?
- A health port, so you can have active load balancing between instances. So now bots can play Ruzzian roulette: Either they connect to a Raspberry Pi with modest output, or your sick gaming computer with LEDs that can crush even the most beefed out bot. If the LED monster is not online, the load balancer will choose the Pi instead!
- Anti-abuse features, such as maximum amount of concurrent streams, and rate-limiting.
After iterating on pandoras_pot, I think I reached a level I’m happy with. And of course you can try it out yourself! The README contains a full on set-up guide.
Step 3: Murdering a Web Server with a Van
So how did it work? Well, great! Turns out a lot of bots really do love downloading data. Here is where I would put nice graphs of incoming connections. Unfortunately for us, it turns out that the crappy e-machine that I bought at a grocery store parking lot 3 years ago, and had been using as a web server, did not enjoy its ride in the back of a moving van. Who could have predicted that? While I think I will be able to mentally recover, the hard drive won’t physically. And no, I did not back up the access logs, because I exclusively reserve backing things up to things that won’t get lost, to save on storage.
I do have some screenshots and observations though! The following is a screenshot from a (really quick-and-dirty) stats page I made for listing connections to pandoras_pot from a while back:
First of all, most connections come from public cloud providers. I suspected that a lot of the traffic was sent through the Tor network, but checking the top hell-consuming IP addresses revealed that none of them were. Some of them came from cloud providers boasting that they were “private” and “anonymous”. Perhaps a NSA honeypot, perhaps legit. I won’t link them here.
Another interesting observation is that a lot of connections download amounts ended around nice numbers like 2 GB, 3 GB, or 0.5 GB. While it is possible that the bots have this as a download limit, there is no reason to be set so high. No, I think this might actually be the memory limit of the cloud provider. Just checking any big cloud provider shows us that RAM is one of the deciding factors when it comes to price, and they usually come in nice round numbers like this. I simply think the bots fill up all memory with some text from The Sorrows of Young Werther, crashes, and then hopefully become a more enlightened individual in the process.
I did also see that a lot of bots had 30 s as a timeout, so it makes sense to focus on performance: pandoras_pot have 30 s to crash some bots. Far from impossible, as it can easily reach speeds of over 100 MB/s depending on your internet connection and hardware (it can go much higher, this was on the ol’ e-machine with performance comparable to my toaster, or possibly kettle). We must note that encryption and compression are big factors here. Talking about compression, there seems to people that respond to malicious requests with Zip bombs. Perhaps a future feature?
Step 4: Ideas for the Future
Alright so that was fun. We can now crash most badly configured bots that access a website. But could we go further?
Anyone who has ever put a Contact Us box on their website knows that some bots will always manage to go through the reCAPTCHA (especially if you’re cheap with the implementation). But hey, that “Submit” button is just sending a request to your server. It would be sad if it lead somewhere else…
You could also do any kinds of crazy routing, what about redirecting suspected traffic? Checking incoming requests against AbuseIPDB to redirect them immediately?
Note: Don’t blindly redirect IP addresses. Some people need Tor to access the free internet. Others are forced by their ISP to share public addresses with others.
Conclusion
When you set up a website, a lot of people will look at it. Many of them won’t be people. Most of them are harmless, but if you actually did include your .env
in your public website, you might be in for a bad time.
Some people might point out that it is possible to write better bots that can detect pandoras_pot. While that is certainly true, why not mess with the ones that do not? We also have the advantage here: pandoras_pot can change how responses look using a config. That is much easier than detecting it in code. The more people that use their own custom configuration with pandoras_pot
, the harder it is to avoid.
In short: Be nice, respect /robots.txt
, or suffer eternal consequences.