Show bad-mannered bots the door to 403
In a perfect world any unsolicited bot would first query /robots.txt to check if it is welcome in the first place.

Well, in a perfect world. Many bots actually do not extend that courtesy, be it unwittingly or not. And while their requests usually do not pose an actual problem in terms of load, traffic, or bandwidth, they can be still a nuisance as they clutter log files, distort statistics, “leak” data, and are generally clientes non grati.
Unfortunately there is no straightforward way to detect and block them. However — as so often in IT — fingerprinting can get you closer to an informed decision and fortunately HTTP requests often provide a lot of clues to make such one.
Even though the following rules are specifically for Apache’s rewrite module, the approach is applicable to any server supporting similar checks. In the case of Apache — for performance reasons — it would be best to keep the directives in the server configuration, though if necessary they can configured on-the-fly via access files as well.
First and foremost we should make sure the rewriting engine is actually enabled
RewriteEngine on
Once we have made sure of that we can continue on to actually keep out the bad boys ....
.... though hang on a sec ....
Regardless of what we want to block, it is a good idea to keep the following line as one of the first ones, before any other rule.
RewriteRule ^robots\.txt$ - [L]
In that way we never block access to robots.txt and give even the shadiest-looking bot the benefit of doubt.
In medias res
Welcome to 1996
HTTP 1.0: no remotely modern browser still uses it. It was superseded by 1.1 in ‘97, only a year after its introduction (2.0 took just a tad longer, 18 years).
That was almost 20 years ago and no major browser uses 1.0 anymore. In particular, when your site is configured under a virtual host which requires HTTP 1.1.
Nonetheless, many (presumably simple) HTTP crawlers still send 1.0 requests (and yet include a host header then).
This configuration blocks HTTP 1.0 requests
RewriteCond %{THE_REQUEST} 1\.0$
RewriteRule .* - [F,L]
May I have your scheme?
Whenever a request originates from a link, a client will usually send a referrer header specifying the URL the request came from. Despite being an optional header, bots sometimes set it and they set it to values a normal client would never do, such as e.g. omitting the URL scheme.
This configuration blocks requests providing a referrer which does not start with a valid scheme such as “http”, “https”, etc.
RewriteCond %{HTTP_REFERER} !^(?:[a-z]+://.+)?$
RewriteRule .* - [F,L]
Nitpicking, while the regular expression is generally perfectly applicable to everyday use (the vast majority of referrer URLs these days start with either “http” or “https”), technically it still doesn’t strictly follow the definition of scheme names as defined by section 3.1 in RFC 3986. Should this be important to you, you might want to replace the pattern with the following one
!^(?:[a-zA-Z][a-zA-Z0-9+.-]*://.+)?$
Grown up referrers
Even though technically valid, host names are typically not expressed in upper case (eg. EXAMPLE.COM for example.com). Nonetheless crawlers sometimes believe this to be the right thing to do.
This configuration blocks requests with http://EXAMPLE as referrer
(adjust to your setup and liking: HTTPS, w/(o) WWW, w/(o) TLD, etc.)
RewriteCond %{HTTP_REFERER} ^http://EXAMPLE
RewriteRule .* - [F,L]
Referrer, the Third
Yes, surprisingly, referrers are actually often a good indicator of a client’s legitimacy.
The following rule highly depends on your page setup and your internal links. However, it often does catch crawlers requesting a URL and claiming in the referrer to miraculously come from the very same location.
This configuration blocks requests to a given URL which send a referrer for the same URL.
RewriteCond "%{REQUEST_SCHEME}://%{HTTP_HOST}%{REQUEST_URI} %{HTTP_REFERER}" "^([^ ]+)/? \1/?$"
RewriteRule .* - [F,L]
If you are only concerned about the root URL and don’t mind hardcoding the domain, you could also go with the following more simplified and probably slightly better performing version.
RewriteCond %{HTTP_REFERER} ^http://(?:www\.)?example\.com
RewriteRule ^$ - [F,L]
Again, whether you can and should use this rule highly depends on your setup and without care there is a good chance you might break certain types of regular requests.
So much about referrers .... anything else?
User Agents
The user agent is often a good indicator as well. For example, there is not a single browser which does not send a user agent. Hence, we can easily say we block every request not providing a user agent at all.
This configuration blocks requests with no user agent header or only an empty string
RewriteCond %{HTTP_USER_AGENT} =""
RewriteRule .* - [F,L]
The current browser catches the worm
The following directives can definitely be somewhat controversial and can potentially have a huge impact on your users, so proceed with caution.
These days major browser vendors release new versions faster than [for-lack-of-a-joke-insert-your-favourite-one]. Furthermore these updates have become rather automatic and obligatory.
Taking this into account and the fact that many operators of crawlers do not seem to update their user agent strings or underlying browser engines that often we could say we only allow somewhat recent versions.
Nonetheless though, even despite the automatic and somewhat opaque updates, people are not always necessarily on the latest version. Because of this we shouldn’t be too strict and only let the most recent version through, but rather adopt a somewhat more lenient approach and still accept a few versions back.
This configuration blocks requests of Firefox, Chrome, and IE versions matching the respective regex
RewriteCond %{HTTP_USER_AGENT} "rv:(?!38)(?:[1-3]?\d|4[0-1])\.(?!.+like Gecko)" [OR]
RewriteCond %{HTTP_USER_AGENT} Firefox/(?!38)(?:[1-3]?\d|4[0-1])\. [OR]
RewriteCond %{HTTP_USER_AGENT} Chrome/(?:[1-3]?\d|4[0-5])\.(?!.+Edge) [OR]
RewriteCond %{HTTP_USER_AGENT} "MSIE [1-6]\.0"
RewriteRule .* - [F,L]
As of the time of this article (April 2016) Firefox is on version 45 (both for release and ESR), IE on 11 (and likely to remain), Edge on 25 (but disguises itself as Chrome 42 anyhow), and last, but not least, Chrome on 49 (parentheses here just because every other browser got a side note too).
Based on these versions, the configuration above would only permit Firefox >=42 (+ the previous ESR 38), IE >=7, and Chrome >=46 (+ those with an “Edge” tag because of Edge’s quirky versioning).
Space: the final frontier
While legitimate clients usually send user agent strings which contain at least a few spaces, bots often send a whole-string instead.
This configuration blocks requests with user agents not containing a single whitespace
RewriteCond %{HTTP_USER_AGENT} !\s
RewriteRule .* - [F,L]
It’s Mozillaaaaaa
Instead of checking for whitespaces as in the previous example, we could also specifically take a certain identification for granted (and explicitely include white spaces). For example, virtually all major browsers (except older versions of Opera) still identify as Mozilla 5.0.
This configuration blocks requests with user agents not starting with Mozilla/5.0
RewriteCond %{HTTP_USER_AGENT} "!^Mozilla/5\.0 "
RewriteRule .* - [F,L]
Be careful, evaluate & fine-tune
None of these checks are foolproof or provide 100% accuracy (except maybe for the empty user agent) and, depending on your site structure and visitors, some of them do bear the potential to break your site fundamentally in terms of sheer accessibility.
Because of this it should be of utmost paramount to first verify how “compatible” these checks are with your typical visitors before you implement any of them.
However, if done properly they can offer a fair share of upfront protection against automated crawlers, particularly those catching you by surprise.
I’ve had that configuration now already for several months in production and was eventually surprised by its overall effectiveness. Of course it can’t keep out every single crawler and from time to time one slips through, but in general it does catch a lot of unsolicited bots which come out of the blue and believe to be entitled to crawl your site and collect its data.