Fighting Bots — Part I

Published in

SSENSE-TECH

9 min readMay 12, 2023

Best practices to protect your public API in a serverless context.

Exposing an API to the outside world begins a new chapter in the cyber adventure of a business, pursuing a wider audience and inviting larger groups of potential customers. In an ideal world, the API design and implementation would drastically differ from the harsh reality brought about by cyber creatures strolling through the Wild Wide Web, causing all sorts of unexpected disturbances.

In this first article of a 2-part series, we will provide a high-level view of the challenges that arise when maintaining and protecting a public API. In the next article we will cover some of the best practices to mitigate bot traffic when running on an AWS serverless stack.

Types of Bots 🤖

A bot is commonly defined as an automated request executed against an online service. Bots can be running from a personal computer in a basement somewhere by an amateur coder, from a web browser plugin installed by a random user, from a powerful distributed bot farm behind large proxy pools, created by somewhat seasoned developers selling it as an official online service, and sometimes even from a physical mobile device farm completely emulating human interactions.

Here are the main categories of bots affecting both mobile and website endpoints:

Search Engine Crawlers

These bots mainly land on a website, indexing its content and sometimes validating published product feeds. We consider them as legit bots that we don’t want to suppress at all. Sometimes they may unintentionally increase the load on the infrastructure, but typically have no major impact on server performance.

Vulnerability Scanners

Almost any public endpoint will have these bad actors who try to scan your resources for all known and new potential vulnerabilities. They typically don’t bring any significant traffic load on the infrastructure, being filtered out and suppressed at the very top level due to their primitive signature and very specific requests. They are considered malicious traffic we want to suppress entirely.

DDoS Bots

These bots can be quite dangerous. Coming from a bad actor, they can bring your infrastructure to its knees in a matter of minutes if no protection and mitigation strategies are put in place. DDoS may happen as an intentional attack or as a side effect of too many automated requests running at the same time. Similar to vulnerability scanners we want to suppress them entirely.

Automatic Checkout Bots

These are retail-specific bots. Similar bots may exist for travel booking, hunting for the best seats and prices on your behalf. They are quite popular within fashion e-commerce websites too. Typically, a third party provides a paid service where any online user can run an automated sequence of commands against a target website to create user accounts, login, search for specific products by SKU, add them to a cart and do a fast checkout with a specified method of payment and shipping information.

Some of them are quite powerful and able to imitate mobile client behavior and solve challenge captcha at very high rates. Requests are distributed across multiple IP addresses to lower the detection rate and blocking. Checkout bots represent the largest volume of automated traffic we have to deal with at the mobile endpoint.

There is also minor traffic originating from self-made amateur scripts for automatic checkout, which are often run from home computers or cheap shared online hosting servers. Most of the time they do not even work properly, but they create noise in the traffic logs and monitors which we would like to suppress as well. Some of them are left forgotten by their creators and keep running for years with no specific purpose.

We consider automatic checkout bots as semi-legit traffic: it is fine as long as it does not abuse our infrastructure, otherwise we need to slow it down to leave bandwidth for actual legitimate users.

How Checkout Bots Work

The checkout bots leverage the underlying public API of a website or mobile application directly instead of simulating user interaction at the UI level, which would be more complicated and slower. Public APIs, meant to be used by the website and mobile application operation only, are relatively easy to be discovered by reverse engineering the traffic going towards the servers: by capturing the endpoint URLs, headers, request and response payloads, authentication sequence, and the entire flow for searching, adding to cart, and ordering products.

Once the API contract is known, the next step is to automate required checkout steps, simulating requests from a real application. To be a good bot, it is also required to be able to bypass various protection layers like WAF, solving a CAPTCHA, spreading requests via a range of IP addresses and acting on behalf of multiple registered users simultaneously.

Where do I Begin?

Depending on the popularity of your website, bot traffic can be quite significant, spiking drastically during sale periods or hype product drops, or even during the off-season, randomly exceeding all organic traffic by several folds. As a developer and e-commerce solution architect, you don’t want to risk exposing unprotected public APIs to the whole world.

The recommended approach is to learn industry best practices, plan ahead, and design the system bearing in mind external risks. This means moving in two parallel directions: building safe redundant protected infrastructure and automatically monitoring incoming traffic. Then, analyzing the results, adjusting and repeating.

Plan for multi-layered safety measures at almost every system component and tier. The main principle when dealing with outside traffic is: never trust a caller. You should always check the request for all violations which can disrupt the system, jeopardize the infrastructure downstream, or don’t make sense, and suppress them upon detection. You cannot be too strict about your contract either: give as little freedom of the input variety for a caller as possible. This will help you make the contract simple and allow you to have better control over the incoming traffic.

In reality, of course, you have to compromise between the effort and cost of protecting your resources versus the threats your system can sustain, and how much effort (and money) it would take for the other side to bypass all those measures.

Reference Mobile Endpoint Architecture

Let’s now focus on reference architecture and highlight the responsibility of each one.

Figure 1 illustrates the main components:

Mobile application
AWS WAF
AWS CloudFront
AWS AppSync
AWS Lambda
Microservices running on some sort of elastic environment (for example: Kubernetes)

Your first line of defense is a WAF: web application firewall, a must-have component protecting the connection between your web application and the internet. While it does not protect from all types of attacks, it enforces a large set of filters and blocks against vulnerabilities exploiting HTTP traffic.

Next comes AWS Cloudfront, which among other features provides DDoS attack protection and can be configured to plugin a third-party service for traffic heuristic analysis based on ML and AI. This in itself is already an important first step towards online protection against typical malicious traffic any public endpoint would suffer from otherwise. Configuring WAF or CloudFront is out of scope for this article, but consider it an industry standard essential for any online service.

In this scenario, the microservices serve both the website and the mobile app, hence excessive traffic on one side can affect performance on the other side. Appropriate designs like circuit breaker, bottlenecks, rate limiting, and throttling can be used to decouple and isolate systems as much as possible to prevent domino effects across the stack should a stress event happen in one or multiple microservices.

There are also serverless stack implications: when different application lambdas are running under the same AWS account, they share the same pool of Unreserved Concurrency (UC) values. If the limit is too low, or lambda invocations have a huge spike in one application (say due to bot traffic), other unrelated service lambdas can be affected and automatically throttled by AWS due to a limited number of UC left for them to share and operate.

A possible workaround is to move such services away from each other to different AWS accounts. Another isolation measure for lambdas is to specify an explicit Reserved Concurrency value for each lambda. This guarantees no other lambda can take away the allowed concurrency from the given lambda. It is important to understand that once the RC value is reached by the lambda instances, the lambda function gets throttled by the AWS stack and incoming requests will be rejected.

In theory, a lambda could scale its instances almost unlimitedly by increasing its concurrent executions. But in reality, the downstream services it is connected to have their own limit of sustainable traffic that they can handle. For example, if a microservice SLA is 10K RPS for both web and mobile clients, and the website needs 8K RPS during high traffic periods, we may need to limit (throttle) mobile traffic calls once they reach 2k RPS to protect the downstream stack from overloading. Although it is not a perfect solution, slowing down some of the system components allows you to keep the system functional instead of causing a stack collapse due to unexpected uncapped traffic. These are standard system reliability considerations defined in SRE fundamentals.

Using the serverless stack has its obvious benefits but also has limitations and particularities which have to be taken seriously and planned for.

My API — My Rules

Nothing helps a developer’s life like a well-defined API contract with strict and explicit rules. It is equally beneficial for other teams to integrate the API, preventing API misuse, and protecting the service from malformed external traffic. The sooner a request can be detected as not serviceable and rejected, the better the overall system performance and security.

Hence, your next level of protection is cutting undesired traffic via API schema validation. For the mobile endpoint, you can leverage AWS AppSync which supports basic GQL schema and its validation. Be mindful that the default GQL schema validation executed by AppSync is too relaxed and does not provide the desired level of security. The only native option for GQL parameters validation was a resolver mapping template based on VTL language, which is developer-unfriendly and not unit-testable. If you want to have more flexibility, you can shift it one level lower, which in the reference architecture is done by lambda. This also gives you the freedom to use any language and validation package, as well as total control of the implementation, including code reuse and unit testing.

If you choose this approach you can check each and every input field of GQL request for its type, allowed values (including range), maximum length allowed for string parameters, and even maximum length allowed for the arrays of values. You can even afford the conditional validation of one field based on another field value. This logic can be as complex as needed, yet lightweight and fast enough to provide request sanitizing according to the API contract.

Based on our own experience, we noticed:

A lot of legacy and malformed requests can be stopped right at the entry point and not reach the downstream services anymore, significantly reducing the error and warning noise in the logs of several services. Reduced traffic allows us to keep the infrastructure at a lower scale, reduce its cost and keep dashboards cleaner.
Downstream API failures due to request payload being too big can be eliminated.
Internal issues due to API misuse by your own mobile application can be found. This can allow further improvements and cross-team API contract awareness.

To Be Continued…

Any service that is publicly exposed must incorporate bot traffic handling as part of its plan.

Bot traffic comes in many forms and its consequences depend on your specific context. In general, you need to have a strategy on how to handle and prevent it from having negative outcomes, be it in the form of preventing legitimate traffic or incurring unnecessary costs.

We presented a reference architecture, with an in-depth defense that aims to provide a starting point for you to improve upon.

In the next article we will continue our journey by looking at other important practices you may want to leverage as part of your toolkit.

Editorial reviews by Catherine Heim & Mario Bittencourt

Want to work with us? Click here to see all open positions at SSENSE!