Revenge of the nerds

How data scientists catch fraudsters (part II)

Konstantinos Leventis

Published in

Fourthline Tech

16 min readMar 20, 2024

*Leaf-wise tree growth. Source: Reproduced from* *LightGBM*. Image by author.

“Times are changin’, Betty. These nerds are a threat to our way of life.”
– Stan Gable

In part 2 of this series on KYC-fraud detection we dive into the deep end and explore exactly how Fourthline’s data scientists develop algorithms to catch fraud. Whereas part 1 focused on unsupervised-learning techniques and their performance and applicability, in this part we demonstrate how we use supervised learning to strengthen Fourthline’s fraud-prevention capabilities.

Why supervised learning?

In part 1 we use unsupervised learning to explore relationships between different data points (e.g. document type, or time of submission) and assess their relative strengths. As a result, we gained insights into the data and explored the potential of unsupervised learning, by trying half a dozen algorithms and pushing most of them to their limits.

However, unsupervised algorithms can, at best, achieve 40% for both recall and precision on both our synthetic fraud-detection data sets and data from the real world (in production). Such models lag considerably compared to human performance and raise questions over how to serve them in production to add value to our products. Turns out, we can do much better!

Gradient-boosted decision trees

There is no shortage of supervised-learning algorithms available. Even going through the relevant library of scikit-learn should keep one busy for days on end. However, in this post, contrary to the previous one, we will not be comparing textbook approaches. Instead, we will showcase how we achieve state-of-the-art results.

Let’s cut to the chase then and focus on the framework we have been using for years, gradient-boosted decision trees (GBDTs). Why boosting? To start with, numerous publications favor it over all other approaches for tabular data:

GBDTs are also the framework of choice in Kaggle competitions, for example in the IEEE-CIS Fraud Detection competition, which is very similar to KYC fraud detection. Moreover, GBDTs proved to be on a different level in terms of predictive power on our data sets, easily outperforming all other methods we have tried.

*Graphical illustration of the learning process of GBDTs.* Image by author.

How do GBDTs work? At every stage, a tree is added to an ensemble of trees. Each new tree (without loss of generality, let’s call it the n-th tree of the ensemble) is trained to predict the residuals between the collective prediction of the previous (n-1) trees and the labels of the training data. The ensemble is then boosted by adding the weighted predictions of tree n to those of the preceding trees. By limiting the total number of trees and introducing other tree-growing hyperparameters (e.g. depth, number of leaves), boosting frameworks let us regulate resources and overfitting.

There are 3 main GBDT algorithms on the market: LightGBM, XGBoost and CatBoost. Each represents a slightly different GBDT approach and they can barely be separated in terms of performance. However, LightGBM performed best on our data sets and training it is considerably faster. So, for us, the choice was a no-brainer.

Having decided which algorithm to use, you might assume we could just throw data at it and hope for the best. After all, GBDTs can handle different data types (numerical and categorical), are impervious to missing data (null values) and aren’t particularly sensitive to hyperparameter choices. So, what could go wrong?

While all this is true, as data scientists know very well, there are other tasks involved that spell the difference between a decent model and a remarkable one. Such tasks are data selection and splitting, feature engineering, model training, validation strategy, to name but a few. In the following sections we will thoroughly analyze how we approach each step in the process of developing fraud-detection models that set the bar in the industry.

Data preprocessing

*The trials of data preparation.* Image by *xkcd.com*

Preparing data for training is one of the most arduous tasks in machine learning. Before even getting to the feature-engineering stage, we preprocess raw data with three basic goals in mind:

Eliminate data that either adds nothing to the model or, at worst, confuses it.
Homogenize and combine data originating from diverse sources.
Create splits while preventing data leakage.

Domain nomenclature

Let’s first define some domain jargon we will need in this section.

Case: All the information collated for an Identity-Verification instance, including the outcome. This is triggered by a natural person submitting an application for a service, e.g. opening a bank account. Each case in our data set is a ‘sample’, with its own set of feature values and a label. Our goal is to predict if the case is fraudulent or not.

Client: The natural person whose identity needs to be verified. Multiple cases can be associated with the same client.

Data selection

In Fourthline’s AI team our job is to use machine learning to turn data into products. We must, therefore, understand:

The origin of the data we use, e.g. mobile-phone cameras, lighting conditions, mistyped personal data.
Their lifecycle, from receiving the data to storing it in a database and all transformations in between.
The meaning of the data.

In addition, we have to understand exactly how we can strengthen Fourthline’s products and ensure our machine-learning models are fit for purpose.

One example is the type of cases the models are trained on. For Identity Verification, used for example when opening a bank account, the client needs to submit a particular set of data. But for Client Authentication, used to re-authenticate the account owner’s identity at some later point, less data is typically required. Therefore, there are systematic differences in the available data in these two types of cases. This heterogeneity is best addressed by developing different models for each use case.

Data splitting

The basic requirement to fulfil here is to prevent data leakage from the train set to the validation set and the test set. There are two criteria that need to be simultaneously satisfied to achieve that:

Firstly, splits are based on client ID. Therefore, cases associated with the same client end up in the same split. Since there is potential for substantial feature overlap between the cases of the same client, we would like to avoid spreading them across splits because this effectively skews either the validation or testing metrics. In any case, it ends up compromising the quality of the trained model.

Secondly, splits are based on time. All cases in the test split have dates in the future relative to cases in the train and validation sets. In this way we emulate what happens in production where a model trained and validated/tested on historical data is used to predict the outcome of new data. The model is therefore susceptible to various mechanisms that can reduce its performance, like domain changes, or unseen categories, during testing. As Nobel-prize winner Niels Bohr once said: “prediction is very difficult, especially if it’s about the future”.

Splitting data to train (including validation) and test. Two factors are accounted for: time of application (position on the time axis) and client ID (represented by different colors). The test set lies in the future, relative to the train/validation set. However, cases with the same client ID go in the same split, illustrated here by the deep-blue cases, even though one was submitted before and the other after the split date. As we will see later, further splitting into train and validation sets also accounts for client IDs, but not for time. Image by author.

Feature engineering

“Coming up with features is difficult, time-consuming, and requires expert knowledge. ‘Applied machine learning’ is basically feature engineering”.
~ Andrew Ng

Fortunately, at Fourthline we relish difficult tasks, otherwise fraud detection would be a terrible choice for a machine-learning project! We have been refining our approach for years and, with the support of in-house fraud experts, continuously strive to convert raw data into powerful features.

Feature types

The features we create shed light on and track the history of the four pillars of a case: who, where, when and what.

*Four pillars: who, where, when and what.* Image by author.

Who

There are many features that describe who a case relates to. Some straightforward aspects, like age and gender are not very powerful predictors on their own, but the model can combine them usefully with others. Furthermore, we can track the historical record (if any) of that client in our database.

We can get an even stronger grip on the who by creating more complex features, especially by aggregating individual data points (e.g. nationality, place of birth) into complex composite descriptions of the client, that are nevertheless generic enough to have the same values, for very different people. A naïve example is to combine place of birth with information from the where and create a boolean feature that captures whether the client lives in the country they were born in.

Where

This seemingly simple category actually includes a diverse set of features ranging from obvious to advanced and computationally expensive to create. Obvious features include the home address provided by the client and the geolocation of the device that submitted the application. More complex and more powerful features can be created by considering the history of those and other features, for example, the application history of a given address.

When

To unlock the potential of temporal features , they must be approached with great care. Even the time of the day a case was submitted can be used as a feature. But beyond such simplistic, and often of limited use, features, time is implicit in many of the features we create. For example, whenever the history of an aspect of a case plays a role (think here a person’s or an address’s application record), time needs to be accounted for.

What

This category contains two distinct groups of features. The first group defines what the client used when submitting their case, e.g. ID-document type, device brand/model. Like other simple features, members of this group are especially useful in combination with others. The second group, probably the most important set of features overall, defines what our internal automated processing (spearheaded by our AI-services suite) determined about the case and its artifacts. By far the strongest predictors of fraud are based on Fourthline’s AI models, which are purpose-built for KYC.

Creating features

The backbone of feature engineering is an in-house-developed class, based on scikit-learn’s ColumnTransformer. This is an elegant and efficient way of feeding a bunch of raw data points to a class and getting a pandas DataFrame as output, but it can also be tricky to make major revisions. This choice is effectively a trade-off between flexibility and the convenience of a centralized transformation-factory pipeline.

The advantages of our feature-engineering pipeline lie not so much in the objects we use to do the work, but in how we use them, how we choose transformer hyperparameters and the safety guards we implement to prevent data leakage.

Feature-engineering architecture. Features (or, data transformations, more generally) generated at a particular stage can only be used in subsequent stages. Some of the features we use are the result of a series of compound transformations, which is why we need multiple stages. Image by author.

Handling categorical data

A lot of the raw data and features used during training are categories, expressed in strings. However, strings are inefficient during processing and can overconsume precious resources, such as time and RAM. Therefore, a ubiquitous transformation in our pipelines is ordinal encoding of categorical features to integers. The integers are still fed to the training algorithm as categories. To achieve this, we piggyback on the functionality of scikit-learn’s OrdinalEncoder and ensure the models we train never have to manipulate strings.

Selecting transformer hyperparameters

A lot of the transformers we use come with hyperparameters, the values of which can be very important both for performance, and generalization. For example, with categorical features, the categories provided or the cut-off frequency can effectively be seen as a bias-variance trade-off. More categories favor in-domain performance than generalization. For all features types, our selection of hyperparameters is driven by training a model on many instances of the same feature, but where each instance has a different hyperparameter set. After the model is trained, feature importance can indicate which choice is the best.

Preventing data leakage

An important step in preventing data leakage comes at the feature-engineering stage. To ensure that information from the future test set does not trickle in the train set, we only engineer features for the train set. This way the transformer pipeline, which includes the state of every transformer employed, will only have seen the train set when it’s called to create features for evaluating the model on the test set.

Feature importance and optimal feature set

Feature importance is a somewhat overloaded term. A trained LightGBM model has the property feature importance baked in it. Depending on whether we choose “split” or “gain” when calling the function feature_importance() we get an ordered list of features in terms of either the number of times a feature has been used during training, or the gain a feature brought during training, respectively.

However, we are chiefly interested in assessing feature importance during inference. For this, we have developed our own implementation of permutation feature importance. We use it to evaluate the importance of each feature when the model is called to predict labels on a test set. This approach uses one or multiple metrics to estimate how much the performance of a trained model drops when the values of a single feature in the test set are randomly permuted along the axis of samples. Doing this multiple times for each feature (with a different random seed each time) yields statistically sound results.

Permutation feature importance on the test set. The top 40 (anonymized) features are presented in order of decreasing importance, according to the AUC metric. This is a reliable way to assess the relative importance of two (or more) features. But in order to decide whether a feature helps the model overall, we train models with and without that feature and then compare results. Image by author.

An importance-driven ordered list of features is also the starting point for determining the optimal set of features. The idea here is to use recursive feature elimination whereby we use all possible features in a training, assess their importance and then removed them one by one starting with the least important. At each step we retrain the model and monitor performance metrics to identify which set of features extracts the best performance out of the trained model.

Model training

All machine-learning engineers know that training is not the be-all and end-all stage in creating production-worthy machine-learning systems. But it is still pretty important, so let’s dive into it.

Problem definition

We train machine-learning algorithms to distinguish fraudulent from non-fraudulent ones, based on their features. So, this means we train binary classifiers, right? Not necessarily. This is another area where domain understanding has helped us optimize our approach. Fraudsters try a variety of methods to ‘fool the system’, from fake personal data and ID documents, to deepfake selfie videos. Even within those categories there are sub-groups, separated by finer differences.

We trained and compared binary models (where all types of fraud are bundled under a single class) against both multiclass classifiers and binary classifiers trained on only one fraud type. We found that binary classifiers performed worst, while multiclass models perform virtually the same on specific fraud categories as binary classifiers trained on one fraud type. Therefore, to reduce computational resources and complexity we opted for multiclass classifiers.

Options for class definitions, classifier(s), and their corresponding performance. In the single-fraud-class scenario, a binary classifier can be trained to distinguish frauds from non-frauds. However, in the case of multiple fraud classes, one can train either a multiclass classifier or multiple binary classifiers. Both perform better than a binary classifier. Image by author.

Ensemble

In reality, we don’t train one model, but k models in a k-fold cross-validation fashion. This ensures that, during training, every single data point not in the test set is seen by k-1, that is, all but one model. As with train/test splitting, fold splitting is based on client IDs, which prevents identity leakage between the k-1 folds being trained at every iteration and the validating k-th fold. The k-fold cross-validation approach is pretty standard among machine-learning engineers and desirable where permitted by the size of the data set (also accounting for minority-class representation) and the computational resources available.

LightGBM-hyperparameter optimization

Like all machine-learning algorithms, LightGBM can be tuned to perform optimally on different data sets and objectives by changing the values of hyperparameters. These parameters cannot be learned during training and instead remain constant throughout a single training experiment. For example, we can reduce overfitting by limiting the number (num_iterations) or the variance (max_depth) of the model’s weak learners — in this case decision trees.

When training any model in the ensemble, the corresponding validation set has two roles. It is used for early stopping, whereby no more trees are added to the model once user-defined metrics stop improving on the validation set. And it is also used to select the best combination of model hyperparameters on each fold.

The 4 main approaches to hyperparameter tuning. Top left: exhaustive grid search. It can also be exhausting as every possible combination of candidate values for different hyperparameters is actually tried out. Top right: non-exhaustive grid search. Same as previous but with some combinations removed. Bottom left: random search. In this cases randomness can be manifested in two ways, (a) random combinations of predefined hyperparameter values are tried out and/or (b) random values (drawn from predefined distributions) of hyperparameters are tried out. Bottom right: adaptive search. A diverse family of approaches under which the hyperparameter values are neither predefined, nor random, but are decided by an algorithm based on results obtained at other points on the hyperparameter space. Image by author.

While there are several possible ways to fine-tune LightGBM hyperparameters, GBDTs are very robust to hyperparameters. Therefore, sophisticated optimization techniques typically yield little gain, provided that unoptimized hyperparameters are not miles off the sweet spot in the hyperparameter space. But once even remotely close to the sweet spot, gradients of the objective function are tiny.

Objective and validation metric

The metric that we use to decide which model is best is recall (for the fraud class) at business-relevant precision values. However, the loss function optimized during training is a preset in LightGBM, namely multiclass. When multiclass is chosen the model’s output uses softmax to express its prediction scores for each class and then the output can be used to compute the multiclass cross-entropy (log loss). This is what is being minimized on the train set. We note here that “multiclass_ova” whereby the model outputs binary predictions for each class separately, typically yields very similar but slightly inferior models.

We evaluate the validation set using the multi_logloss metric, which is basically the same as the objective. Therefore, the value of this metric on the validation set determines both early stopping and hyperparameter tuning. Other options yield similar, but usually not better, results. Metrics multi_error and auc_mu are also good choices and the resulting models are virtually indistinguishable from those trained with multi_logloss.

Inference and evaluation

So far, we have covered every stage of the machine-learning pipeline that results in well-trained, powerful fraud predictors. However, going from an ensemble of multiclass softmax predictions to a single, business-relevant fraud assessment is a twisty road, with some obvious, and some less obvious choices to be made along the way.

From ensemble to single multiclass predictions

First, the softmax outputs for every class from each model are averaged across all k models. The result is a single multiclass (still softmax-like) prediction, as if only one model was doing inference. There are various other possibilities here. However, averaging class predictions across models takes account of every model’s numerical prediction, and therefore best utilizes the ensemble.

From multiclass to binary predictions

Next, we sum up the predictions for all fraud subclasses into a single fraud class. Since the multiclass predictions add up to one, the resulting binary-like prediction also adds up to one, emulating the output of a binary classifier. Now we are ready to impose score thresholds that yield precision performance in line with product requirements.

ROC curve (left) and precision-recall curve (right) for a fraud-detection model after reducing multiclass ensemble predictions to a single binary prediction (fraud score), for each case. At the moment of writing this, these results were almost a year old. We have since added features and improved the training process. It has been a long time since we last saw AUC values lower than 0.99. Image by author.

It is worth taking a moment here to compare performance against the unsupervised-learning algorithms in part 1. Both precision and recall are more than two times better, going from a maximum of 40% to easily above 80%, for thresholds where these two metrics are equal.

Explainability

It is very valuable to have access to the reasons behind model predictions, not only to gain insight into the model’s inner workings, but also because of the stringent compliance requirements in the regulated industry where our models operate. Decision trees are particularly useful in that respect. GBDTs present more of a technical challenge, mainly due to their size. A handful of libraries exist that address this challenge, and this is one of the few instances where Fourthline relies on third-party solutions. There are numerous use cases for this functionality. On the one hand, we can use it internally to assist human specialists in decision making and highlight attention areas. On the other, we can share explanations as needed with our business partners and even regulators seeking to ensure our decisions are justified and based on facts, rather than being the outcome of biased or inadequate machine-learning models.

Behind the scenes and in the works

Despite this being a lengthy post, it omits some important aspects of our work that are either in the pre-production stage or relate to tasks other than fraud detection, and which deserve their own blog post. Behind the scenes, and already mature, are two major development efforts.

On the technical side, we have established robust workflow-management procedures that automate lots of previously semi-manual work and enable us to use features that significantly improve our models’ performance. We use Apache Airflow for automating a variety of tasks related to both the research and productization side of fraud-detection models. With help from Fourthline’s data and platform engineers we have created a computing cluster optimized for handling fraud-detection data sets and models, while eliminating manual work performed by data scientists.

High-level visualization of workflow automation using Apache Airflow. Data is fetched and all preprocessing, feature engineering, model training and testing are dispatched to a dedicated cluster. The resulting models are then compared against the model currently running in production. If better, the new models are automatically deployed. Image by author.

The second development effort concerns a very different, but extremely important topic. We have developed a framework for assessing and ensuring fairness in our models. This is based on literature and original research on measuring fairness and guaranteeing no bias in the models we productize. The result has been a set of configurable fairness diagnostics that assess both the data sets on which we train our models, as well as the models themselves. What’s more, we have implemented small but crucial modifications to our model-training pipeline that let us to control bias when needed, while largely maintaining model performance. Eagerly awaiting the first comprehensive AI legislation act in the world at the time of writing this, we expect that the fintech industry will be one of the most affected and that fairness of machine-learning models will be a major factor for regulators when assessing models and permitting them to operate.

Diagnostics for bias in the data are applied before and after data selection (on raw and preprocessed data, respectively). Presence of bias in a trained model is assessed by comparing fairness metrics across sub-groups of the test set. Bias-reduction switches can be turned on during the data-selection and training stages. Image by author.

Summary and outlook

In this post we have zoomed in on all aspects of how Fourthline develops top-performing data-science models for KYC-fraud detection. We have explained our approach and choices at every step in the process and shared salient insights into the most crucial stages: data preparation, feature engineering, and model training. We hope that this not only sheds light on how we achieve great results, but also motivates other data scientists (in the KYC industry and beyond) to borrow interesting ideas.

With all the side projects and ideas we currently have in the works, stay tuned for the next blog post in the series.

Eleanoor Polder is the Fourthline data scientist behind most of the work presented in this post, with notable contributions from Vuk Glisovic and Sebastian Vater. Fourthline’s AI, Data and Platform teams have continuously supported the work of the Data-Science team. The time and expertise generously shared by the Anti-Financial-Crime team of Fourthline has been absolutely crucial in this project. Finally, this post was polished by Fourthline’s technical writer Francesca Cook.