Fighting Bots — Part II

Published in

SSENSE-TECH

9 min readJun 2, 2023

Best practices to protect your public API in a serverless context.

In our last article, we started the discussion around dealing with bots and how to leverage AWS services to help with this.

But there is more to discuss! In this second article, we will look at additional levers you may want to consider including from an online e-commerce application perspective.

Checkout Bot: Friend Or Foe

As the API service provider, not only do you have to take care of high traffic volumes from regular users but you also have to deal with bot-generated traffic and keeping it at bay.

As a business, we’re happy to have as many successful orders placed as possible, as long as they come from non-fraudulent sources. Unfortunately, 99% of bot traffic results in failed orders, contributing to additional load and costs to the infrastructure.

Here is a typical checkout bot behavior with slight variations depending on its implementation:

Create one or multiple user accounts in the target system
Login to the account, or several accounts at the same time
Search for specific products by SKU
If found, add to cart. If a product is not available, keep trying to add it to the cart until it’s available.
If the cart contains the required products, proceed to checkout and pay with a predefined payment method, providing a predefined shipping address.

These steps can be repeated by bots at any speed or volume, during any period of time, for one or multiple accounts at the same time. Some bots run such requests for 10 to 20 minutes, while others can run for hours. Typically, if products are available, the orders get placed within the first seconds. But, sometimes the same sequence of requests is repeated to no avail, meaning there are no products available anymore but the traffic still puts unnecessary pressure on the infrastructure, potentially compromising the performance of other components and degrading actual users’ experience.

Unfortunately, many bots are implemented to run at the maximum possible speed, resulting in thousands of requests per second. This exhausts system resources, leaving little to no operational room for legitimate online customers. Which brings us to a common dilemma: how to protect the system from high bot traffic spikes while keeping actual users unaffected?

Play by the Rules or Go Away

Another best practice for a public API is rate limiting. According to this technique, an API contract guarantees servicing calls to a given URI at a rate not exceeding a certain limit. The rate can be measured as requests per second, requests per minute, etc. Once the limit is reached, a caller request gets rejected and is usually accompanied by an error “Too Many Requests” (status code 429 in HTTP terms).

Rate limiting in itself is a simple yet efficient technique to protect your system from excessive traffic, which does not require expensive infrastructure. It rejects abusive requests fast and without the rest of your service even knowing they were coming. But there’s a catch: to rate limit properly it is important to identify the requests from the same origin correctly, making sure they belong to the same user or bot. Otherwise, it is easy to accidentally block unrelated requests and legitimate users may be affected.

At this point you need to define a rate limit key. A rate limit key can be any unique string correctly identifying an incoming call. For example, it can be an email the user tries to login with, an already logged-in user ID, a web browser user-agent signature, an API key, or a combination of similar fields from the request which would identify a caller uniquely and reliably. Once you have determined the key, you check it against a caching solution of your choice to validate if its appearance frequency matches the allowed rate. If it doesn’t, you reject the request and move on to processing the next one.

The rate limit key determination is a problem in real-life anonymous Internet traffic where users do not have to authenticate or provide any credentials. When an API always requires an API key to get access to a set of services, the owner has full control over the use of incoming traffic. But once an endpoint allows anonymous requests which do not need any kind of authentication, that’s where the fun begins! This is particularly true in the context of automated bot traffic that will undoubtedly exploit any opportunity to sneak under the radar in the rush to get their checkout sequence working as fast as possible.

Here is what rate limiting usually looks like: you can see flat and steady traffic, for the most part, then suddenly a bot comes and starts calling the same function at an excessive rate. And only that bot’s request gets automatically blocked. Depending on how poorly a bot is programmed, you would see more than 98% of that traffic blocked just by rate limiting. This means that 98% of those requests do not need to be serviced by the rest of the stack. Moreover, bots never complain about being blocked, they just keep trying and never give up. In turn, we can never give up protecting our infrastructure.

In a typical rate limiting implementation, the automatic traffic will still go through at the allowed rate and give bots a chance to accomplish their sequence at a slower pace. More aggressive and adaptive rate limiting may implement a complete block window approach: once the excessive traffic is detected for a key XYZ and confirmed to be lasting for, say 30 seconds, it gets blocked completely for a predetermined period of time, say the next 15 minutes. Once the blockade has expired, the traffic is analyzed again and the process repeats. This method allows even further minimizing of the impact from the automated lengthy attacks which do not yield any business value but contribute to unnecessary infrastructure scaleup.

Remember, bots are smart, once they get blocked from using one account, they create another one and continue the pattern. Or even worse, a bot can create hundreds of user accounts at the same time and initiate slow traffic for each of them not hitting the rate limit threshold, but cumulative resulting traffic can still be huge and affect the target backend system.

Developers must define a better rate limiting key when faced with this issue. It requires creativity and a good understanding of the nature of the request. Bots are not there to spam your system, they are trying to reach a specific goal: accelerated purchasing of a target product or service.

Automatically identifying that target set of parameters helps generate a good rate limiting key. For example, if an API sells tickets for various events and bots are trying to get tickets for a specific event, with a specific date, seat range and payment method, then some or all of these attributes can be used to form the rate limit key while ignoring the user ID or email they belong to.

It’s important to ensure that the keys used are unique enough to only limit the problematic requests without affecting legitimate users.

Last Line of Defense or Plan “Th”

Here is another practical example: a sudden spike of traffic causes a huge jump of lambda concurrent executions due to its automatic scaling. This traffic could not only overwhelm the downstream microservices, but even cause other totally unrelated lambdas to get throttled if the number of unreserved concurrencies available for the AWS account is exhausted. A typical sign that your stack cannot handle the spike gracefully is an increase in the function execution time (duration). If you rely on AWS serverless stack, you need additional measures to address this danger.

A properly designed system knows its limits. Regular load testing easily shows bottlenecks and the realistic maximum traffic it can efficiently handle. Based on those SLAs, each component can preventively self-limit to avoid spreading abnormal load increases.

This is where a lambda throttling comes in handy. Use explicit Reserved Concurrency (RC) for each lambda in the system to provide two important protective features:

A lambda is guaranteed to have a number of concurrent executions available up to the RC limit. This prevents lambda concurrent executions from affecting each other via the Unreserved Concurrency pool.
Once a lambda reaches the RC limit, it gets throttled by AWS lambda service, meaning no more new instances of that lambda will be created and any additional inbound requests will error out until existing instances of lambda become available for processing.

Side note: always consider the maximum concurrent executions metrics (as opposed to average metrics) of your lambdas. These are the metrics used for lambda throttling defined by the reserved concurrency setting. Once the metrics reach the threshold, AWS lambda starts the throttling.

Setting the explicit RC limit for each lambda is not very elegant, but it’s a compromise solution to stop an unexpected flood of traffic the system cannot scale up for. The consequence of the lambda throttling is a request rejection. Unlike rate limiting — where the request rejection is isolated to a specific key and origin and hence does not affect a wide range of users — the throttling is “blind” and global for a given lambda. It will block both malicious and legitimate requests during the throttling period. This can be an acceptable preventive measure to avoid a global system overload. Sacrificing one or two lambdas affected by unexpected traffic, the stack can still hold other lambdas and microservices operational. In reality, the throttling of short-lived outbursts of traffic goes almost unnoticed by regular users, assuming the mobile app has a request retry strategy to recover from these situations.

Here is an example of a traffic spike for a GQL function: the lambda function had a reserved concurrency limit set to 20, and once it reached this level the throttling started. For some time, the function was partially operational, but the downstream microservices did not even notice this extremely abnormal influx and did not have to scale up or to engage circuit breakers and other protective measures of their own. Eventually the traffic was automatically suppressed at the edge level thanks to other adaptive security measures in place.

Lambda gets throttled once CE limit is reached

Additional Thoughts

Here are some other improvements to consider:

Bring request parameter validation closer to the entry layer. Static validation could be executed at the faster Lambda@Edge level instead of the slower application lambda middleware. Lambda@Edge handler is a lightweight fast function with low latency and fast cold starts. Moving static validation code there would not dramatically affect its performance, but would better isolate application infrastructure from the ingress network and security layers. An important note regarding Lambda@Edge: it has its own pool of unreserved concurrency with its own limit, not shared with regular AWS lambda unreserved concurrency per AWS account. The disadvantage of moving validation to Lambda@Edge is that it will require schema contract and validation logic to be kept in sync and would take a separate deployment step to push changes to the Edge.
The same approach can be applied to rate limiting. Moving it to the earlier processing tiers would unload slower application lambdas from even being fired when there is no legit request coming in. A concern here is that there is additional latency due to connecting to caching storage for tracking rate limit keys.
To have better control over mobile API traffic, sometimes it can be beneficial to impose stricter authentication rules not allowing guest users to execute any calls at all. That would give better results for rate limiting and connection security overall. It all depends on the specific use cases an API has to support. On the other hand, denying the guest traffic completely may lead to fewer potential users in the long term.

Conclusion

No matter what you do to protect your system from external malicious traffic, there are always more sophisticated actors simulating legitimate requests, making it harder to recognize at the entry point. No single method can provide the required level of protection on its own. Therefore, it is important to plan for a multi-layered defense strategy, where each layer can handle specific types of threats.

Be extra vigilant to reduce the chances of unexpected events affecting your system performance. Never trust external requests, always be prepared for the worst, and expect bad intentions from outside actors. Continuously monitor, investigate, and adjust as you go. Unfortunately, that’s the reality of today’s retail public API.

Editorial reviews by Catherine Heim & Mario Bittencourt

Want to work with us? Click here to see all open positions at SSENSE!