Eliminating Fraudulent Bot Traffic With AI

Mamadou Coulibaly
Peaksys Engineering
6 min readJul 18, 2023

Almost 40% of Internet traffic comes from bots. Some of this traffic is legitimate, like that from search engine indexers, and some of it is not, like the heavy volumes from DDoS attacks.

Bots can be sorted into two categories: good bots and bad bots.

Good bots are designed to increase productivity. These include chatbots and indexers essential to search engines.

Bad bots, however, are designed to perform tasks like crawling, DDoS attacks and brute-force attacks on customer accounts.

In this article, we will study how an anti-bot’s artificial intelligence can be used to identify and block the most sophisticated bots: those used for discrete business intelligence.

What is a bot?

A bot is an automated or semi-automated piece of software programmed to carry out certain tasks. Bots imitate or replace human users with simple requests or a web browser used by humans.

There are several reasons online platforms would want to protect themselves against bots:

· Protect data from competitors

· Limit infrastructure costs

· Optimise site availability

· Improve the browsing experience

A growing share of bot traffic

In recent years, the share of bot traffic on the Internet has grown continuously. Nearly 40% of Internet traffic is thought to come from bots, with 30% coming from bad bots.

In addition to the increase in bot traffic, another trend stands out: bots are becoming increasingly sophisticated in a bid to outsmart anti-bot protections.

We can sort bad bots into three categories: simple bots with a single IP address that cannot run simple JavaScript.

Intermediate bots, which are a bit more complex and can run JavaScript in a typical web browser.

Advanced bots, which can also use structured browsing sessions to reproduce human behaviour and carry out distributed attacks with several IP addresses at the same time. These bots are estimated to represent a third of bot traffic overall, and this is the share of traffic that is growing.

AI at the heart of the mechanism

To combat the rise of sophisticated bad bots, we have implemented solutions based on Artificial Intelligence and Machine Learning to identify and block them in order to defend ourselves.

First, we set up a captcha supervision system using sampling rules, a random challenge of sessions with a budget.

Captcha supervision mechanism

With this, we were able to build a data set with millions of examples of human and bot traffic. This data set was then fed to our AI for learning. Once it was trained, the AI was able to make predictions for users it had never seen before:

AI learning pipeline

The model produces a bot probability between 0 and 1. We can then set a threshold for considering that we are dealing with a bot. This threshold is set at 0.5 by default, but by increasing it to 0.9, for example, we can target those bots for which we are most certain and limit the numbers of false positives.

The higher the threshold and the less the model will falsely catch users, lowering the number of false positives.

Visualising our AI

We use a high-performance random forest model using data collected from human and bot browsing behaviour.

The random forest model is made up of several decision trees. Each decision tree results in a classification. These classifications are then pooled to obtain the final classification.

For a random forest with 100 decision trees, each of the decision trees returns its classification. If more than half of the decision trees (here, 50 trees) return the “bot” classification, then the final model will also return “bot”.

This also allows us to infer a bot probability. For example, if 95 trees return “bot”, then the bot probability from the final model is 95%.

Here is a graphical illustration of one of our decision trees:

Illustration of a Decision Tree

To categorise a user as a bot or a human, we start at the top of the tree. At each node of the tree, we go left or right depending on the threshold value.

For example, the top node contains the variable “referer_empty”. The threshold value for this node was set at 1.50 during learning. This results in two cases:

· referer_empty < 1.50: the sequence continues down the left branch

· referer_empty > 1.50: the sequence continues down the right branch

This process continues down to the lowest node. The lowest node is called a “leaf” because it cannot have an underlying node. The leaves are used to return the tree’s classification. The size of the leaves’ circles indicates the share of users that were categorised during learning. The bigger the circle, the greater the chance a user will be found in that leaf node.

Example AI classification

Examples of the AI model’s classification

User 1 is considered a bot with a bot probability of 90%, whereas user 2 is considered a human with a bot probability of 4%. These classifications are based on the input variables. We can see that user 1 made a lot of requests, has empty cookies, no referrer and generated a 404 error.

Whilst learning, the AI could see bots with similar behaviour. All these parameters led the AI to identify a bot.

User 2, however, made an average number of requests on Cdiscount.com, does not have an empty cookie, has a referrer and no 404 error. All these variables led the AI to consider this user as a human.

Performance evaluation method

There are several criteria to consider when evaluating our AI’s performance. The goal is to detect as many bots as possible while minimising the number of false positives.

False positives are legitimate users whom the AI categorises as bots.

There are several indicators to evaluate an AI’s performance, such as accuracy, precision, recall and the F1-score. All these indicators can be calculated using a confusion matrix.

Confusion Matrix

The confusion matrix is a table that summarises a model’s classification performances. It shows the number of true positives, false positives, true negatives and false negatives.

Accuracy is an indicator commonly used in Machine Learning to evaluate a classification model’s performance. It represents the share of correct predictions among all the predictions made by the model.

More formally, accuracy is defined as the ratio of the number of correct predications to the total number of predictions:

The accuracy formula

Retraining the model

Each night, we retrain the model with the new data collected during the day. Several dozen models are generated, and the one with the best score out of the data collected is compared to the model in production to decide if it will replace it.

Our results

In 2022, we made improvements that included the addition of new data that increased the model’s decision-making capabilities. We combined server data with browser data:

Examples of client-side and server-side variables

With this new data, the number of bots detected grew fivefold, whilst the number of false positives was divided by five. We can say that our bot detection is 25 times more effective.

False positives divided by 5
Number of bot detected multiplied by 5

Our Bot Detection AI model currently has an accuracy of almost 99.9% in production, and we decided to offer it with a SaaS model through the product at https://baleen.cloud.

We know that bots will continue to evolve to elude detection, so Baleen will continue to improve its AI over time to counter threats in the future.

--

--