Simple Bad Bot Mitigation With Signal Sciences

Published in

Compass True North

5 min readNov 28, 2022

There is a ton of bot activity on the Internet. Anyone tasked with defending high-traffic websites will know this well. Some bots are good and desired — think search engine crawlers. Some bots are benign such as scripts making API calls or monitoring site availability. But often, bots can be either a nuisance impacting performance or outright malicious. These are the “bad” bots you wish you could easily block and not deal with anymore. In this post, I would like to share a simplistic automated approach to implement pretty good bad bot mitigation using the Signal Sciences Web Application Firewall (WAF).

Defining Bot Types

Before I get into the details, I’d like to offer how I define a bot and a bad bot. A bot is any client making requests to your site that is not a human operating a modern web browser. So bots are scripts, command line tools, programs, malware, etc., that operate autonomously or are invoked by a human. What makes a bot bad is its behavior. For example, is it sending malicious payloads or probing for private directories and files?

Example Bot request:

A Google ad bot sending a benign request

Example Bad Bot request:

Identifying Bots

The first step is identifying any bot that may be probing, crawling, or poking at your site. Keeping things simple, we can use a very rudimentary method for identifying bots — simply based on user-agent string length. A user-agent string is a request header, set by the client, that lets servers and network peers identify it. Normal humans using a modern browser will always have a minimum user-agent string length of some number, x. While bots can spoof the user-agent string, most bots do not. Their user-agent strings tend to be much shorter than a browser user-agent string. As a result we can assume x minus some number n is not a human — it’s a bot. You can decide your threshold of what x and n should be.

There are various sources on the Internet where you can find lists of user-agent strings. Here is an example of one of these sources:

In Signal Sciences we create a corp-level rule, which applies to all sites. The rule performs a regular expression match on the user-agent string, and if there is a match it adds the signal of “Bot”. Notice in the screenshot below the regular expression specifies a length of up to 70 characters so in this example x — n = 70. Also, notice the minimum length is zero, which means it will also match empty user-agent strings.

With this rule and signal in place, reviewing the data it produces over time is worthwhile. It can provide visibility to interesting traffic you may not have been aware of previously.

Example results from the Bot rule:

For a quick explanation of corp-level rules and site-level rules if you are unfamiliar with the concepts. Rules applied at the corp-level can be implemented across all sites running the Signal Sciences agent. Rules applied at the site-level are only implemented in the agents for that particular site. Use the option that makes the most sense in your environment.

Identify Bad Bots

Now that you can identify bots, how are they behaving? In Signal Sciences, we can create another simple corp-level to define bad behavior for bots. The first condition for this rule looks for the Bot signal, and the second condition looks for any number of attacks, anomalies, or custom signals. You can easily adjust the rule to include signals that make the most sense for your scenario. In the example below, we also included an IP list match with threat intel data. To summarize the rule, if it’s a Bot and it’s doing any number of bad things, then it is a bad bot, and add the “Bad Bot” signal to the request.

Example results from the Bad Bot rule:

Enforcement

The final step is to create our enforcement rule, either a rate limit or block rule. Note that rate limit rules are applied at the site level. A block rule can be implemented at the corp-level or site-level and will block all traffic with the Bad Bot signal. Use caution with this level of aggressive blocking. Consider monitoring the data over time to gain assurance the block rule would not impact legitimate production traffic. Alternatively, a rate limit rule provides flexibility to determine your tolerance level before blocking Bad Bot traffic. In the example below, the rate limit rule condition is on any request with the Bad Bot signal and will begin blocking after seeing 10 Bad Bot signals within 1 minute. The blocking decision will be active for 5 minutes. The count threshold and action duration can easily be modified to your desired values.

Conclusion

Leveraging user-agent string length for bot detection is a simple approach, and in my experience, effective at identifying the majority of bot traffic. By determining the user-agent string length for your scenario, identifying bad behavior, and applying rate limiting you’ll be off to a solid start mitigating unwanted bot traffic. Signal Sciences does make this approach easy to implement with its rules capability. However, if you do not have Signal Sciences, given this simple approach, it would be worth exploring the capability of tools you do have available to identify bots, bad bots, and enforcement.