Binary Passion
Published in

Binary Passion

Show bad-mannered bots the door to 403

In a perfect world any unsolicited bot would first query /robots.txt to check if it is welcome in the first place.

Many bots seem to share Bender’s catchphrase when it comes to respecting your crawling policies

Well, in a perfect world. Many bots actually do not extend that courtesy, be it unwittingly or not. And while their requests usually do not pose an actual problem in terms of load, traffic, or bandwidth, they can be still a nuisance as they clutter log files, distort statistics, “leak” data, and are generally clientes non grati.

Unfortunately there is no straightforward way to detect and block them. However — as so often in IT — fingerprinting can get you closer to an informed decision and fortunately HTTP requests often provide a lot of clues to make such one.

Even though the following rules are specifically for Apache’s rewrite module, the approach is applicable to any server supporting similar checks. In the case of Apache — for performance reasons — it would be best to keep the directives in the server configuration, though if necessary they can configured on-the-fly via access files as well.

First and foremost we should make sure the rewriting engine is actually enabled

RewriteEngine on

Once we have made sure of that we can continue on to actually keep out the bad boys ....

.... though hang on a sec ....

Regardless of what we want to block, it is a good idea to keep the following line as one of the first ones, before any other rule.

RewriteRule ^robots\.txt$ - [L]

In that way we never block access to robots.txt and give even the shadiest-looking bot the benefit of doubt.

In medias res

Welcome to 1996

HTTP 1.0: no remotely modern browser still uses it. It was superseded by 1.1 in ‘97, only a year after its introduction (2.0 took just a tad longer, 18 years).

That was almost 20 years ago and no major browser uses 1.0 anymore. In particular, when your site is configured under a virtual host which requires HTTP 1.1.

Nonetheless, many (presumably simple) HTTP crawlers still send 1.0 requests (and yet include a host header then).

This configuration blocks HTTP 1.0 requests

RewriteCond %{THE_REQUEST} 1\.0$
RewriteRule .* - [F,L]

May I have your scheme?

Whenever a request originates from a link, a client will usually send a referrer header specifying the URL the request came from. Despite being an optional header, bots sometimes set it and they set it to values a normal client would never do, such as e.g. omitting the URL scheme.

This configuration blocks requests providing a referrer which does not start with a valid scheme such as “http”, “https”, etc.

RewriteCond %{HTTP_REFERER} !^(?:[a-z]+://.+)?$
RewriteRule .* - [F,L]

Nitpicking, while the regular expression is generally perfectly applicable to everyday use (the vast majority of referrer URLs these days start with either “http” or “https”), technically it still doesn’t strictly follow the definition of scheme names as defined by section 3.1 in RFC 3986. Should this be important to you, you might want to replace the pattern with the following one

!^(?:[a-zA-Z][a-zA-Z0-9+.-]*://.+)?$

Grown up referrers

Even though technically valid, host names are typically not expressed in upper case (eg. EXAMPLE.COM for example.com). Nonetheless crawlers sometimes believe this to be the right thing to do.

This configuration blocks requests with http://EXAMPLE as referrer
(adjust to your setup and liking: HTTPS, w/(o) WWW, w/(o) TLD, etc.)

RewriteCond %{HTTP_REFERER} ^http://EXAMPLE
RewriteRule .* - [F,L]

Referrer, the Third

Yes, surprisingly, referrers are actually often a good indicator of a client’s legitimacy.

The following rule highly depends on your page setup and your internal links. However, it often does catch crawlers requesting a URL and claiming in the referrer to miraculously come from the very same location.

This configuration blocks requests to a given URL which send a referrer for the same URL.

RewriteCond "%{REQUEST_SCHEME}://%{HTTP_HOST}%{REQUEST_URI} %{HTTP_REFERER}" "^([^ ]+)/? \1/?$"
RewriteRule .* - [F,L]

If you are only concerned about the root URL and don’t mind hardcoding the domain, you could also go with the following more simplified and probably slightly better performing version.

RewriteCond %{HTTP_REFERER} ^http://(?:www\.)?example\.com
RewriteRule ^$ - [F,L]

Again, whether you can and should use this rule highly depends on your setup and without care there is a good chance you might break certain types of regular requests.

So much about referrers .... anything else?

User Agents

The user agent is often a good indicator as well. For example, there is not a single browser which does not send a user agent. Hence, we can easily say we block every request not providing a user agent at all.

This configuration blocks requests with no user agent header or only an empty string

RewriteCond %{HTTP_USER_AGENT} =""
RewriteRule .* - [F,L]

The current browser catches the worm

The following directives can definitely be somewhat controversial and can potentially have a huge impact on your users, so proceed with caution.

These days major browser vendors release new versions faster than [for-lack-of-a-joke-insert-your-favourite-one]. Furthermore these updates have become rather automatic and obligatory.

Taking this into account and the fact that many operators of crawlers do not seem to update their user agent strings or underlying browser engines that often we could say we only allow somewhat recent versions.

Nonetheless though, even despite the automatic and somewhat opaque updates, people are not always necessarily on the latest version. Because of this we shouldn’t be too strict and only let the most recent version through, but rather adopt a somewhat more lenient approach and still accept a few versions back.

This configuration blocks requests of Firefox, Chrome, and IE versions matching the respective regex

RewriteCond %{HTTP_USER_AGENT} "rv:(?!38)(?:[1-3]?\d|4[0-1])\.(?!.+like Gecko)" [OR]
RewriteCond %{HTTP_USER_AGENT} Firefox/(?!38)(?:[1-3]?\d|4[0-1])\. [OR]
RewriteCond %{HTTP_USER_AGENT} Chrome/(?:[1-3]?\d|4[0-5])\.(?!.+Edge) [OR]
RewriteCond %{HTTP_USER_AGENT} "MSIE [1-6]\.0"
RewriteRule .* - [F,L]

As of the time of this article (April 2016) Firefox is on version 45 (both for release and ESR), IE on 11 (and likely to remain), Edge on 25 (but disguises itself as Chrome 42 anyhow), and last, but not least, Chrome on 49 (parentheses here just because every other browser got a side note too).

Based on these versions, the configuration above would only permit Firefox >=42 (+ the previous ESR 38), IE >=7, and Chrome >=46 (+ those with an “Edge” tag because of Edge’s quirky versioning).

Space: the final frontier

While legitimate clients usually send user agent strings which contain at least a few spaces, bots often send a whole-string instead.

This configuration blocks requests with user agents not containing a single whitespace

RewriteCond %{HTTP_USER_AGENT} !\s
RewriteRule .* - [F,L]

It’s Mozillaaaaaa

Instead of checking for whitespaces as in the previous example, we could also specifically take a certain identification for granted (and explicitely include white spaces). For example, virtually all major browsers (except older versions of Opera) still identify as Mozilla 5.0.

This configuration blocks requests with user agents not starting with Mozilla/5.0

RewriteCond %{HTTP_USER_AGENT} "!^Mozilla/5\.0 "
RewriteRule .* - [F,L]

Be careful, evaluate & fine-tune

None of these checks are foolproof or provide 100% accuracy (except maybe for the empty user agent) and, depending on your site structure and visitors, some of them do bear the potential to break your site fundamentally in terms of sheer accessibility.

Because of this it should be of utmost paramount to first verify how “compatible” these checks are with your typical visitors before you implement any of them.

However, if done properly they can offer a fair share of upfront protection against automated crawlers, particularly those catching you by surprise.

I’ve had that configuration now already for several months in production and was eventually surprised by its overall effectiveness. Of course it can’t keep out every single crawler and from time to time one slips through, but in general it does catch a lot of unsolicited bots which come out of the blue and believe to be entitled to crawl your site and collect its data.

--

--

--

01010111 01100101 00100000 01100001 01110010 01100101 00100000 01100001 00100000 01100011 01110101 01110010 01101001 01101111 01110101 01110011 00100000 01100010 01110101 01101110 01100011 01101000 00101100 01100001 01110010 01100101 01101110 01110100 00100000 01110111 01100101

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
....

....

Animal Welfare, Software, Linguistics, Politics

More from Medium

Why Cloud Computing is safest and the most secure?

How does Static Analysis Work?

Batch Job in Cloud Native World

Why is MongoDB Proving to be The Most Popular Database of Our Time?