Practical Aspects of Machine Learning for Fraud Prevention in Production

Published in

Incognia Tech Blog

7 min readAug 17, 2021

In addition to being a research field in constant evolution, Machine Learning (ML) is used today in several industrial applications, from image classification to user mobility. Incognia uses ML to help detect real-time authentication fraud attempts in mobile applications, thus reducing user’s friction. Using ML in production, however, brings forth some crucial challenges, from how to effectively choose the right model for the task at hand, to how to deploy and maintain it. This post highlights our key learnings about the process of developing and deploying an end-to-end ML pipeline from a practical standpoint, showing how we combined it with heuristics-based solutions to improve our risk assessment engine.

Mobile Fraud detection — a brief introduction

Mobile transactions are growing rapidly. The global mobile payments market, for instance, is projected to reach US$4.7 trillion by 2025. Concurrently, it is estimated that in 2019, 75% of all fraudulent transactions originated on mobile devices. Fraudulent events vary in type and perpetrators: account takeover (ATO), location spoofing, and SIM swap are a few examples. ATO fraud takes place when a fraudster gains full access to a legitimate user’s credentials and other information necessary for login and thus can perform actions without repercussion.

ATO prevention may be implemented in multiple stages of a user’s journey, the main one being at login. When solving this problem, there are two main challenges: detecting fraudulent login attempts and avoiding increasing friction for legitimate users. If all fraudulent login attempts are detected, but this leads to legitimate users having to verify every attempt through a secondary factor such as an email, this may harm the application by causing user churn due to an increase in friction. At the same time, accepting all logins without additional verification steps, while not detecting most fraud attempts, is undesirable. Thus, good ATO prevention solutions should strive to strike a balance between fraud detection and friction reduction.

Challenges in ATO prevention

Detecting fraudulent login attempts is a challenging task, and when using machine learning to tackle it, there are some key points to keep in mind:

Most login attempts are legitimate, while fraudulent ones are rare. This means that data sets are expected to have a high-class imbalance factor, which makes the training and metric choosing processes trickier;
Success metrics vary from client to client and through time. Classifiers should be able to be easily tuned to either increase fraud detection at the cost of more friction or decrease fraud detection to reduce friction;
Labeled datasets are not widely available, and even if they were, features and class distribution vary wildly throughout different app user bases;
Login attempts are typically validated in a matter of a few seconds, at most. Thus, classifiers should be served through APIs with millisecond-level latency and be, in general, easy to deploy and maintain in production.

Training a classifier

Given the challenges in obtaining and dealing with labeled datasets, one should consider whether to use unsupervised strategies, such as Isolation Forests, or supervised ones, interpreting the problem as a binary classification. Since we had already developed successful Trusted Location heuristics, using a supervised approach we wanted to leverage our domain knowledge and better understand the latent relationships between our features and fraudulent logins.

Obtaining labeled datasets

To obtain labeled datasets, we collected manually labeled data from clients that had been using our solution through a Feedback API. For instance, if a login attempt was assessed as fraudulent by us, our client can send a request to our Feedback API indicating whether the end-user confirmed the fraud attempt (true positive) or denied it (false positive). This was the first challenge we overcame to use a supervised strategy.

Adaptability

The second challenge was choosing a classifier that could handle class imbalance, and be easily tuned, deployed, and maintained. We found all of that in LGBM, which is a gradient boosting framework built on top of tree-based models. In addition to tree-based models being good alternatives to deal with class imbalance, LGBM also offers easy ways to choose different decision functions for them, which enables users to use state-of-the-art class-balanced functions such as Focal Loss and the one published by Cui et. al..

As for ease of use, LGBM provides a Python API and is easily integrated with Pandas and Sklearn. Its training is fast and can be distributed, it has native support for parallelization, and it has a low memory footprint, which we will show later on. We were also able to use different thresholds on top of the classifier score: one for fraud detection and another one for login approval, giving us a lot of room for tuning, as Equation 1 shows.

Equation 1 — using two thresholds, α and β, on top of classification score S to approve or deny the transaction.

Feature selection

Having chosen the classifier, we now had to choose which features to use. This is where our accumulated knowledge about the fraud landscape gained through the lens of our heuristic approach helped us. We held a few brainstorming sessions and came up with several features which correlated to fraudulent logins and ATO in general. Then, we considered two means by which features would end up in production:

How important is this feature to the classifier’s output? We utilized Shapley Additive Explanations (SHAP), a game-theoretic approach to compute feature importance for machine learning models. When compared to standard feature importance computation methods, SHAP offers clearer visual explanations and case-by-case analysis possibilities. We were then able to quickly identify and prune irrelevant features.

How easy it is to compute this feature in production? We sketched architecture diagrams to understand the necessary workload for each remaining feature. Interestingly enough, we noticed that some of the most troublesome remaining features to compute in production were some of the least important in terms of SHAP. Those were also pruned.

Deploying the classifier

To deploy the classifier, we had to deal with two different environments: training and production. In the training environment, we wanted to recurrently generate features in batch, train our classifier on top of those, and save the resulting model somewhere. Meanwhile, in the production one, we wanted to compute the features in real-time, feed them to the classifier to obtain a classification result, and use this result to compute a risk assessment to be returned to our clients. At this point, we had little machine learning infrastructure.

Figure 1 — our training and production ML architecture.

Training architecture

As Figure 1 shows, the training pipeline was implemented in two parts: feature generation and model training, both orchestrated with Rundeck. The feature generation makes use of Spark to compute the several classifier features in a distributed manner, using our data infrastructure. These features are then cataloged in our data lake and consumed in a separate, stand-alone training step written in Python, that trains the model, utilizes Optuna to optimize its hyper-parameters, and saves the classifier to specific S3 buckets.

We apply a black box analysis step during this training process, which is responsible for detecting potential issues with the newly trained model. For instance: if the success metrics vary too much from one run to another, this may indicate that a bug was introduced somewhere along the pipeline. Thus, committing this version of the classifier to production could cause some problems, and instead, we halt the process and notify the team.

Production Architecture

In production, we replicated the feature computation queries directly on top of our canonical user database. We performed several tests in staging and pre-production to assert equality between the results of those queries and the ones in the training pipeline. Then, we implemented a secondary API using the FastAPI framework to process these features.

We chose FastAPI because it is easy to integrate with LGBM’s API, and because of its blazingly fast performance: our stress tests showed p99 latencies as low as 5 milliseconds and memory footprints, including the LGBM classifier loaded in memory with joblib, of 300Mb. Latency was especially important to us due to a latency budget of 200 milliseconds for login risk assessments in our API.

Heuristics first!

Figure 2 — a sample of our API response.

Our API response consists of a risk assessment backed by a set of evidence. A crucial business choice for us is using the ML classification result as a piece of additional evidence, being complementary to our other heuristic evidence. Albeit powerful, ML approaches usually tend to gloss over the fact that some cases trivial to the human eye may not be well learned and thus receive an incorrect assessment.

There is also a great gap between training a classifier and understanding its results: in fact, there are whole research initiatives, such as SHAP, dedicated to understanding how a trained model works behind the curtains. Because of that, having safety nets composed of deterministic rules and even human-in-the-loop approaches helps keep solutions sane and safe.

Closing thoughts

In the course of 2 to 3 months, we were able to create and fully deploy a classifier with a measurable impact for our clients: it increased our approval rate by 6 percentage points and our fraud detection rate by almost fifty percent. When combined with our other evidence, we were able to authenticate 93% of legitimate users without any additional friction and detect fraud with a false-positive rate of 0.0013%.

Our machine learning usage in production is still a work in progress, and we are constantly improving in this regard. We are currently gravitating towards MLFlow as a better solution to organize our models. More importantly, this goes to show that simple, pragmatic solutions using ML responsibly can deliver great value. We hope we were able to highlight some key issues we found along the way and our solutions for them.

We have also developed other work related to ML in the past. Feel free to check some of our previous publications on topics such as Place Deduplication and Social Distancing Analysis. If any of this has piqued your interest and you want to take part in building a leading location-based zero-factor authentication solution, Incognia is hiring!