Bot Defense Strategies

Brent Lassi
Bluecore Engineering
7 min readFeb 1, 2019

A bot (short for “robot”) is an automated program that runs over the Internet. The classification of bots varies widely from the entirely legitimate search engine web crawlers to the ultra-malicious denial of service variety. When exposing any service or content to the internet, the threat of malicious bot traffic is ever present. Sometimes the impact is merely a minor lift in traffic, but if unchecked, bot traffic can result in extensive bandwidth costs, degradation of service, data theft, or data pollution.

At Bluecore, bot traffic is carefully inspected, analyzed and managed to ensure that no harmful traffic degrades service or corrupts data, while also allowing helpful bots, such as the type that boost Search Engine Optimization, do the work for which they were designed. Any web-based service where the integrity of the offering relies on the processing of only valid traffic depends on effective bot management to ensure that customers, users, and clients get quality results. Bluecore’s bot management and click-based pricing aim to manage these threats but solutions that fit one organization may not suit another.

The strategies to manage these threats are various and depend heavily on the type of threats present, the application features, and the user scenarios at hand. The following strategies provide options to consider when building defenses against bot traffic.

Login Validation

Forcing a login or otherwise identifying a user is the classic method to assist in the defense against bots, but the use case is limited to scenarios where there is a known user in play. Nothing within this method prevents a registered user from performing malicious activities, but it does facilitate detection, prevention and cleanup capabilities due to the user association. This method can be augmented by any and all of the following mechanisms.

Email Validation (or another out-of-band method, Certificate, SMS, Authenticator, etc)

Using email validation of an activity such as registration validation, password reset requests or other validations needing an out-of-band assurance is an augmentation of Login Validation (or some similar brand of identity verification) and is frequently used to deter malicious activity. In scenarios where a known user is in play, out-of-band validation based on email, phone or a second factor of authentication such as a certificate or authentication device can be used to validate the source of incoming data. Much like Login Validation this does not preclude malicious behavior but does tie all activity to a registered entity facilitating detection, prevention and cleanup capabilities. This method also provides defense against compromise and malicious use of valid accounts.

Captchas

Captchas are typically the go-to solution for defense in any scenario where an anonymous entity can automatically post, crawl or otherwise interact with a web-property, be it the submission of a form or simply accessing content pages. Captchas can be deployed in a number of different manners. Some scenarios might be worthy of leveraging a captcha for any data submission, while others may be configured to be conditional. Conditional captchas are often inserted onto web pages when a user has submitted invalid data too many times (typically an authentication failure) or when a user has submitted data more rapidly than is humanly possible or appropriate for the given service. In a sense, conditional captchas can enforce a modicum or data integrity and a remedial brand of velocity control. The downsides of captcha use revolve around a degraded user experience and the inability to remediate malicious activity on API endpoints.

User-Agent Filtering

A user agent is a software agent working on behalf of the user, or on behalf of an automated process. A user agent is typically used to identify the client software that is attempting to interact with a service. The most common user agents encountered are those identifying web browsers such as Mozilla Firefox, Google Chrome, Apple Safari, and Microsoft Edge. Most often a user agent identifies itself, its application type, operating system, software vendor and software version, by submitting a characteristic identification string in an HTTP header field named User-Agent. This user agent can be used to reject request traffic that is undesirable. Unfortunately, it is a trivial technical matter to spoof a user agent. While valid products such as web browsers, networking code modules, security tools, quality assurance tools and other responsibly configured programs accurately identify themselves, malicious users or bots that wish to mislead can submit HTTP requests with any user agent string, thereby rendering this method useful only for policing legitimate user activity or shunning unnecessary traffic from web crawlers and bots that properly identify themselves.

Generic Scan Detection and Rejection

There are a variety of security scanners, web crawlers, search engines, and hacking tools that make their presence evident based on behavior signatures. Common indicators of these tools include the following:

  • Progressive linear scanning of IP addresses
  • Predictable path browsing of hypertext links
  • Rapid speed and multi-threaded requests or probes
  • Requests for common files that are out of context. (e.g.; Requesting .asp files on an Apache server)
  • Attempts to enumerate directory contents
  • Requests with an inordinate amount of post data that does not conform to the application’s use
  • Requests that include OS-commands, source code or directory traversal strings.

Defending against these common reconnaissance activities is usually managed by a traditional network-level intrusion detection and prevention system or a web application firewall (layer-7 intrusion detection and prevention). While detection techniques can be built into an application or configured within most web servers or load balancers, those methods would prove difficult to maintain and monitor.

Velocity and Volume Traps

Related closely to Generic Scan Detection and Rejection is the concept of velocity and volume traps. These features can be initiated effectively in a variety of manners including code-level implementations, intrusion detection systems, web application firewalls, web/app servers, load balancers, next-gen firewalls, DDoS defense appliances, cloud-based DDoS defense services, and some network routers. This brand of defense can even be built on top of a log server (StackDriver/Splunk/ELK, etc) that monitors flow statistics and performs automated firewall rule updates, effectively shunning traffic based on log analysis. Regardless of the tools utilized, this method requires a keen understanding of the typical data flows, data sources and request volume. In order to achieve a well-tuned velocity control setting, it is best to configure a number of traps at varying levels in a mode that takes no defensive action but logs any tripping of the trap’s thresholds. Analysis of the resulting data will dictate the proper threshold at which to take automated or manual defensive action against the offending traffic.

Input Integrity and Sanity Testing

This category requires little description. In situations where the range of inputs expected are known, the receiving application can reject anything outside of the valid parameters, thereby preventing the ingestion of garbage data and malicious data injections (such as malicious scripts). XML and JSON validation are two basic examples that assess the data format, if not the contents. In-app features that perform ingestion validation could become extremely sophisticated but are typically not a value-adding business feature for the application. For this reason, including these features in the development pipeline should be done as sparingly as possible unless the feature adds business value in manners beyond the defense against bot input.

Bot Defense Services (source filtering and advanced behavioral filtering)

The past half-decade has produced a number of bot defense services that can be configured to defend a network or application from threats that are expected and unexpected with a high level of granularity and accuracy. The benefit of these services is in the data that is collected across the internet by clients of the service as well as a myriad of detection arrays positioned worldwide. The data allows for accurate and up-to-date defense. The downsides of these services include the need to re-route incoming traffic through the service provider’s data center and the cost associated with the service subscription.

Non-human Request Traps

This method can be particularly useful in detecting robotic actions being performed on a webpage or within an email. By creating an invisible hypertext link within HTML content one can lay a trap for any non-human processes that click the link. If a webpage or email has 10 visible and legitimate links and one invisible link that includes a pixel or a blank string, the application can assess with considerable accuracy that the actor requesting the link is not a human user. With this data, the application can ignore all requests and input from the source and exempt the traffic from all click reporting or traffic reporting. The challenge with this method is the labor required to include a transactional-style request cleanup model in the application, as the application should not only ignore the use of the hidden link but all other requests from the source. If the data is not purged, additional work must be done to ensure it is ignored by the application’s processing and reporting activities.

Conclusion

These nine solutions, if properly selected and configured to the needs of the business will positively impact the success of an internet-based business by controlling costs, improving response times, improving uptime and ensuring data integrity. Information security programs and compliance initiatives often focus heavily on the principle of confidentiality, forgetting that integrity and availability are core parts of the security triad (described here: https://en.wikipedia.org/wiki/Information_security). Bots are one of the primary threats aimed at disrupting the latter two security tenants. Pick and choose solutions wisely, and early, before bot threats disrupt the data integrity or performance of your business.

--

--