Revenge of the nerds: How data scientists catch fraudsters (part I)

Published in

Fourthline Tech

18 min readFeb 16, 2021

Image by Mcld under a creative-commons license. Reproduced in this article without any changes.

Tony was woken up by thunder, early on a Tuesday morning. But no matter. The excitement about putting his sinister plan into action soon took over. It was a well thought plan, meticulously prepared. He would open a bank account with a made-up identity and a bogus passport he got from Sidney. Nobody’s better than Sid, he assured himself. With a bank account he could finally wash all that dough that had been piling up from pushing nose candy, and whatever else he could come by. He sat confidently behind the computer and begun the enrollment process. That was easier than I expected, he thought as he leaned back and hit the submit button, a minute or so later. But, to his absolute surprise, his application was rejected within seconds!
How could that be? How was his attempted fraud detected, with such apparent ease and speed?

In this and future posts we will explain how that is possible. In particular we will present and analyze various unsupervised-learning methods applied to fraud detection, in the context of KYC (know your customer), at Fourthline. After briefly making a case for unsupervised learning we take a closer look at some of the approaches we have tested. We summarize the results and draw comparisons between those unsupervised methods, highlighting their advantages, as well as their drawbacks, and conclude with insights that we have gained by applying them to the challenging problem of fraud detection.

Introduction

KYC is the process of verifying the identity and assessing the suitability of potential customers of, primarily, financial institutions. This is important in order to mitigate risks associated with the initiation of a business relationship between those two parties. Money laundering is at the forefront of the various criminal activities that KYC aims at preventing. According to The United Nations, it reaches annual volumes of about 2–5% of global GDP.

As money laundering attracts more and more attention from regulators, the anti-money-laundering market has been growing substantially over recent years, according to market research.

Opening a bank account is a major step in the money-laundering process and fraudsters go to great lengths in order to achieve that. The different types of fraud they (try to) commit can be roughly grouped into document fraud and identity fraud. From tampered documents to deep-fake videos, we’ve seen it all at Fourthline. What we have understood is that in order to be successful in tackling fraud, a financial institution needs to be able to identify all fraud types.

At Fourthline we take fraud prevention very seriously. We achieve remarkable results not just by having fraud experts employing thorough processes to ensure security and compliance, but also by leveraging their knowledge to combine it with state-of-the-art machine-learning algorithms. As modern digital banks understand, complementing human expertise with insights from data analysis and modelling is essential to be successful in a landscape of increasingly sophisticated fraud attempts and strict regulatory directives.

Machine learning

There are various ways in which machine learning can come in handy when detecting fraud. On the one hand there are computer-vision algorithms, commonly based on deep neural networks, and on the other, data-science methods, relying on heterogeneous features and a variety of data types.

On the computer-vision side, the authenticity of documents can be verified in pretty much the same way the human eye would go about doing it; by looking for elements in the image that support or undermine the document’s authenticity. Furthermore, biometric data (e.g. face recognition, liveness detection) play an important role in uncovering identity fraud.

Beyond images, though, there is a large body of data that comes with every application that a KYC provider, such as Fourthline, has at its disposal. From the time of the application, to the type of document and (mobile) device used by the applicant, there is always something to be learnt from historical data. Such data is what we use in order to train algorithms into powerful fraud detectors.

The cool thing is that these two disciplines do not have to remain separate. Results of computer-vision algorithms can always be expressed in numbers, representing confidence in, or probability of a particular outcome, or generally, a ‘score’. These scores can then be taken into account within a data-science context, along with all other non-computer-vision data, on which a more holistic, hybrid approach can be built. And that is exactly how we have compiled the data sets which we use in the following sections.

Speaking of data sets, we note that all results presented in this post are reported on synthetic data, which are loosely based on trends we see in production data. Class imbalance of those data sets has the same ballpark value as in actual KYC databases. The exact numbers presented, therefore, should not be taken at face value. However, the qualitative conclusions we draw hold true on production data as well.

Why unsupervised learning?

Scikit-learn has been used heavily throughout this project. Taken from the scikit-learn website, this is an indicative illustration of the performance and wall-clock time of various clustering methods available in that library.

Before getting into the nitty-gritty, it is worth motivating the use of unsupervised learning for fraud detection. After all, since unsupervised learning ignores labels by definition, why would one use it at all in a (binary) classification problem, such as fraud detection? The answer to that is two-fold:

Unsupervised-learning methods are often the first course of action for data scientists when one wants to let the data (but not the labels) do the talking. In other words, when the preexisting understanding of the relationship between different features is non-existent, or limited, unsupervised learning can provide those first insights that might later prove valuable for feature engineering, outlier determination etc.
In fraud detection correct data labeling is a luxury. Yes, data sets from all kinds of machine-learning problems often suffer from noise, mistakes and so on, but in the case of fraud detection especially, where human mastery of the domain is not common, or easy to come by, one has to be suspicious of the correctness of the labels, in particular the non-fraud kind of labels. Ultimately, we are talking about a heavily imbalanced data set with the fraud class samples being orders of magnitude fewer than those of the non-fraud class. With this kind of data sets, even if a small percentage of the dominant class (non frauds) should actually have the opposite label (fraud), this means that a significant part of the under-represented class is mislabeled, which we would like to avoid.

Having said that, we should make clear that when evaluating the usefulness of the various methods and tuning their hyperparameters, we did actually use the labels. Therefore, strictly speaking, we actually performed supervised learning, but the labels only come into play at the hyperparameter-tuning and the evaluation stages, while training is label agnostic. After all, what we are really looking for is to optimize the separation of frauds from non-frauds.

Evaluation metrics

To evaluate just how good this separation can be, we have mainly used two widely known metrics, precision and recall. The steps we follow to get precision/recall values are simple:

Use a clustering algorithm (even a self-organizing map can be viewed as a collection of clusters, in a sense; more on that later) where the number of clusters is either predefined, or determined dynamically.
Order clusters by fraud potency, i.e. by decreasing fraction of the applications inside each cluster that are fraudulent, regardless of the total number of applications inside the cluster.
Evaluate precision and recall under different groupings of the ordered set of clusters.

Abstract example of a generic clustering process. 6 clusters (enumerated 1–6) are used to group the scattered data and, in general, they end up containing different numbers of frauds and non frauds. In terms of precision, cluster 1 is the highest, with a value of 2/3, since 2 of the 3 applications included in cluster 1 are frauds. In terms of recall, cluster 6 is the highest with 0.4, since it contains 4 out of the 10 frauds in the data set. Cluster 3 has 0 precision and 0 recall.

K-means

K-means is perhaps the archetypal clustering method. With a simple and intuitive algorithm, and decent performance, it is typically the first (and often the last) stop in the data scientist’s journey towards a clustering approach. Its most notable disadvantage is that the number of clusters has to be provided by the user, i.e. is not inferred by the algorithm itself. However, there are ways around that shortcoming, which we discuss below.

A naïve application of k-means to our problem would be to try clustering the data in terms of the features, which in the case of k-means can only be numerical. Once this is done for many runs with different predefined numbers of clusters, criteria such as the elbow method, or the silhouette score can be used to determine which number of clusters results in the most efficient grouping. These criteria are still label agnostic. Once the number of clusters corresponding to the optimal clustering is found, we can start invoking labels to choose which of the clusters contain a significant number, or portion (or both) of fraudulent applications. These will be the clusters of interest.

K-means loss, divided by the number of data points, as a function of the number of clusters for several k-means runs, each with a different number of clusters. Sample size is a modest 100,000 data points and the number of features is 9.

In the plot above, the elbow method would seem to suggest that a number of clusters between 12 and 20 is the optimal, but which one? To try to answer that we can invoke the silhouette score.

Silhouette score for the same runs as above. A tentative local maximum at 12 is visible.

The silhouette score seems to agree with the above conclusion (warning: this is not always the case) and furthermore suggests that the optimal cluster number is 12. It is worth repeating that we still have not accounted for labels in the data. The optimality of 12 clusters is in no way related to which cases are frauds and which are non frauds. It only has to do with label-agnostic measures of the sensibility of grouping data in the parameter space defined by the features.

When we do look at how frauds are distributed across the clusters it turns out that the choice of 12 clusters is indeed a good one. A histogram of the number of clusters as a function of the fraud content can be seen below.

Histogram of clusters as a function of the fraud content within each. Two clusters stand out, containing a fraud fraction much higher than that of the data set as a whole.

The take-home message of this simple exercise is that we can split the data set in such a way that two clusters contain just 7.3% of all the data, while containing 56% of all fraud. These results are also confirmed on a test set and when contrasted with a prior fraud percentage of about 1%, they amount to a marked improvement over the starting point. However, we can do even better.

After devoting a bit more time and computational resources to the problem we found that using a few hundred clusters improves results noticeably. In the following example we have made use of 200 clusters and 400,000 data points. After ordering the clusters in terms of decreasing fraud potency (fraction of applications, i.e. data points, inside each cluster that are frauds), we can create the following plot.

Cumulative precision, recall and fraction of all training data (“prop_all_data”) inside the first N clusters (in order of decreasing fraud fraction) as a function of N.

For a 56% recall, as before, we now get ~17% precision in the training set, which is almost double that of the previous approach. Evaluation on the test set of 100,000 data points shows that the model is generalizing nicely.

Test-set evaluation of the model. Despite the presence of micro-clusters, containing fewer than 1,000 data points, training performance is to a large extent replicated, i.e. the model generalizes.

Perhaps the greatest advantage of using such a large number of clusters is the flexibility that we then have in choosing a high-recall, or a high-precision approach, or a compromise between the two, depending on N. This can make such models a one-stop shop for the various use cases in KYC.

K-prototypes

Categorical data cannot be used with most of the clustering algorithms out there and k-means is no exception. One can always one-hot-encode categorical data, but on the one hand this will drastically increase the dimensionality of the problem (just how drastically, depends on the number of categories) and on the other, it ends up projecting the data on the ‘walls’ of the parameter-space hypercube, distorting the notion of euclidean distance, which is the defining ingredient of the k-means algorithm.

There are smarter ways to go about mixing numerical and categorical data in a clustering algorithm. One such example is k-prototypes, implemented in the k-modes library. The first thing we set about doing when testing this library was a type of sanity check. In particular, we tried to only use numerical data in order to confirm that on that limit, k-means results can be reproduced. Unfortunately, this is not possible and if one tries to do that they will get an error prompting them to use k-means since there is no categorical data in the data set.

Another quickly noticeable shortcoming of k-prototypes is that it is much slower than k-means. This limits the number of experiments we can run, and effectively prohibits runs with hundreds of clusters. Another effect of the slow speed of the algorithm is the limitations it imposes on the amount of data we can use, making anything above a few tens of thousands of samples practically unfeasible on a high-end PC. For 100,000 data points we could make runs up to about 30 clusters. Looking for an elbow we get the following picture.

K-prototypes runs on 100,000 data points and 20 features. A hint of an elbow at around 28 clusters does not amount to anything significant in terms of precision/recall.

When selecting clusters in terms of fraud content, results were rather disappointing, yielding much lower performance than k-means. After some investigations we narrowed down the cause for that to the presence of categorical variables. As we limited the number of categorical features, results slowly approached those of k-means, but without the possibility to run experiments with a large cluster number, we quickly parked this option. Perhaps the greater importance of the numerical features in the data has biased this conclusion, but it seems that there are no benefits in using k-prototypes over k-means for our problem.

Self-organizing maps

Self-organizing maps (SOMs) are a form of unsupervised learning, whereby a relatively simple neural network maps a higher-dimensional input space onto a lower-dimensional (usually 2D) lattice. The biggest difference compared to more conventional neural networks ‘learning’ through backpropagation is that SOMs are trained by utilizing competitive learning, whereby the lattice neurons (the elements of the low-dimensional map) ‘compete’ for the assignment of every training example and the best-matching unit wins. At the end of training each node of the map will generally have many training examples mapped onto it and the same mapping rules can then be used for inference. Essentially, each neuron of the output layer can be viewed as a cluster to which data points are assigned.

SOMs have two main strong points. First, they are fast enough to train and do inference with, even with many hundreds, or thousands of output nodes. Secondly, they make for cool and, more importantly, telling visualizations since dimensionality reduction is an inherent aspect of this method, rather than a post-processing step, as, for example, with k-means. On the flip side, they only take numerical features as input, which turns out to not be such a big issue in our case, as explained in the k-prototypes section, right above. Two libraries implementing SOMs in python were tested. Sompy mainly, and MiniSom to a lesser extent.

Choosing a square of 25x25 nodes as output we can train a SOM network within seconds, using tens of thousands of data points, on a conventional computer. Invoking the labels, we can then look at the mapping result.

Heat maps for number and fraction of fraud cases, per output node for an elementary SOM training.

The good news is that there is structure in the output, in terms of mapping fraud. Not only are the locations of more frauds not randomly distributed on the map, they also largely coincide with areas of high fraud potency. For a first run, this is promising.

With very little effort we can strengthen our insight and see what sort of (normalized) feature values correspond to each output node of the map.

Heat maps of feature values at each output node of the SOM.

By correlating these feature maps with the fraud maps, right above, we can gain a better understanding of what the main indicators of fraud are. And of course, we can go even deeper and correlate features to specific types of fraud, but this is beyond the scope of this post. The bottom line is that SOMs are pretty handy and make for insightful visualizations.

But how do they fare where it really matters, separating frauds from non frauds? Well, they are pretty good at that too.

Precision/recall curves as a function of accumulated output nodes chosen, for a 50x50 node map and a run with ~600,000 samples. Similarly to clusters, for more conventional algorithms, the nodes here are ordered in terms of fraud potency.

Generalization to the test set is not bad at all, especially considering the dimensions of the map.

Test-set evaluation results in precision-recall curves that closely resemble those of the train set.

As we see, results are pretty similar to the ones we get from k-means. SOMs also offer the flexibility of choosing a high-recall or a high-precision approach (provided there is a large enough number of output nodes) and come with the added benefit of faster run times, and almost out-of-the-box visualizations.

Gaussian-mixture models

Gaussian-mixture models (GMMs) are very similar to k-means but with some crucial differences. Instead of N-dimensional spheres as in the case of k-means, clustering is based on N-dimensional gaussian distributions, each with their own mean and variance. Furthermore, the cluster that a data point belongs to is determined by the gaussian component with the highest probability-density function at that point.

Using the scikit-learn implementation we have experimented with different numbers for the components (clusters) and covariance type. We have found that about one hundred components (similar to k-means) and “full” covariance give the best results. However, the computational cost is higher than k-means and this comes at no greater reward. Results on a 200,000 train set, with 5 features and 100 mixture components are as follows.

Precision/recall trade off for the train set used in a GMM run. Results are almost as good as k-means and SOM. However, they take longer to produce.

Similar results are reproduced on a test set, which underlines the generalizing power of this algorithm, as long as the number of components remains reasonable, around 100.

GMM results on the test set look very similar to the train set.

In summary, despite the decent performance of this approach, it does not present itself with any advantage over previous ones. In fact, due to its slower run times, exploration of all the parameters that the scikit-learn library has to offer is not feasible on a PC within a reasonable time frame.

(H)DBSCAN

DBSCAN (or density-based spatial clustering of applications with noise) is an algorithm that groups together data points based on local-density estimates, rather than a distance measure from a given point, such as the centre of a cluster. In this way, data points that are quite far apart may actually be placed in the same cluster whereas others that are relatively near to one another may be placed on different clusters. It produces very nice results, especially when applied to non-globular groups of data.

Some of the main selling points of DBSCAN are the flagging of outliers and its dynamic determination of the number of clusters, both greatly easing the practitioner’s job. Another interesting aspect of the scikit-learn implementation is the ability it offers to select from a long list of distance measures to be used in the density calculation, such as “L1“, “cosine“, “Hamming“ and many others, next to the more conventional euclidean distance measure.

Visualization of L1 (or Manhattan, or taxicab) distance. The blue line is the euclidean distance between the two black dots. The green lines represent different paths that have the same L1 distance. Most of the nicknames of this metric allude to how a person, or car would travel through square city blocks in an urban setting. Image taken from the Wikipedia page on Taxicab geometry (colours modified).

Early results of running DBSCAN were promising. With very few tweaks of the initialization parameters more than half of all frauds, and only 10% of all the data were deemed outliers. An important ingredient in this result was the metric. With ‘canberra’ (weighted L1, essentially) we found that we could get better results than with other L1 variants or different metrics. However, DBCSAN has one important drawback. You cannot do inference with it based on previous trainings. In scikit-learn lingo, it does not have a predict function. This is where HDBSCAN comes to the rescue.

HDBSCAN (where the “H“ stands for hierarchical) is a more recent variation of DBSCAN, which allows for clusters of varying density. This difference is expressed through a distinct set of parameters which do take some tweaking to uncover the algorithm’s full potential. More importantly, it comes with a predict function! Its different parameters take some getting used to and come with some perks, but make for some interesting results.

Precision/recall trade off for a training set of 100,000 samples. The jump at around 630 clusters corresponds to including the outliers and results in throwing away more than half the data while retaining all frauds.

The plot above is the result of a compromise between generalization and performance. However, as we see in the test-set plot below, the trained algorithm fails to closely replicate that behaviour in the train set.

Test-set results using the *approximate_predict()* method of HDBSCAN. A humble 20%/20% precision/recall is achievable, stressing the difficulty in training models that generalize nicely on the test set.

In light of some much better results we could get on the train set, for different parameters, the weak point of this algorithm is generalization. Perhaps unsurprisingly, the best performance on the train set was achieved by allowing for clusters of down to about 5 samples, resulting in many thousands of clusters that would then fail to generalize on the test set. An interesting parameter of HDBSCAN is cluster_selection_epsilon, which in a sort of post-processing manner merges smaller clusters at the train stage. However, when trying to do inference after having trained with this parameter, results were impossible to reproduce, even on the train set itself. Concluding, HDBSCAN has great potential for one-off use, both through clustering and outlier determination but fails to generalize to unseen data, which is what we are looking for.

Comparison

In the following table we will undertake the difficult task of comparing all the algorithms we presented, based on 4 criteria:

Performance in terms of precision/recall metrics
Generalization to the test set
Ease of use, including set up, parameters and visualizations
Computational resources

*test-set generalization was not thoroughly investigated for k-prototypes, given its performance.

The most important aspect we were looking for when testing these algorithms was good performance that can be replicated on a test set. This is important in order to be able to use a trained model for inference, perhaps as a first step in a machine-learning pipeline, followed by some sort of supervised algorithm. However, despite being secondary, ease of use and resource consumption are also important parameters, especially when experimenting on a laptop and wanting to explore the different possibilities that the parameters of these algorithms present. Under that light, it is SOMs and k-means that come out on top, with K-means showing marginally better generalization in the synthetic data, but SOMs offering a great all-around experience, without really lagging behind in the performance department.

If we have to pick a single winner, so to speak, it will probably be SOMs due to the nature of the mapping lending itself naturally to accurate (no extra dimensionality reduction) and insightful visualizations, combined with excellent run-time performance and practically no weakness in its predictive power.

Insights

For k-means, SOMs and gaussian mixtures we were able to explore how the generalization capabilities of the trained models evolved as a function of the number of clusters. For all of them we noticed some degradation as the number of clusters increased. This is to be expected since large cluster numbers imply low number of cases per cluster which makes it easy to cherry pick clusters in the train set with very high precision and accumulate them to build up recall. However, there is no guarantee that this will be as easy on a test set.

On the other hand, relatively high cluster numbers are necessary to have a smooth transition from high-precision to high-recall cluster selections. Which approach will be selected depends on the use case, ranging from automating part of the KYC verification pipeline (high precision is necessary in this case), to having clustering as the first step in a machine-learning pipeline (where a high-recall approach would be favoured). What is certain is that having the option to go for either is desirable.

Perhaps the most important insight that we got from trying out k-prototypes is that the numerical features of our date sets are more important that the categorical ones. This conclusion has been confirmed by subsequent supervised-learning experiments. The implication is that k-prototypes should not be dismissed in general, since there may be data sets where categorical and numerical features coexist and are equally important. For those cases k-prototypes would be one of the very few options for clustering.

Finally, it is worth pointing out that when circumstances allow (H)DBSCAN can be a very powerful algorithm, not least because it is the only one analyzed in this post that offers distance metrics alternative to euclidean, out of the box. This would not have been noted were it not for the L1 distance (and its variants) producing better results than the euclidean metric. Given that observation, combining an L1-type metric with a nicely generalizing algorithm, for example k-means, presents itself as an attractive exercise. And that is exactly what we did, by combining the k-means implementation of scikit-learn with scikit-learn’s definition of the Manhattan distance. The results from this experiments easily matched the ones we get with a euclidean metric without, however, improving on them.

Conclusion

We have presented an introduction to KYC and one of its main goals, which is to prevent fraudulent applications from succeeding in opening a bank account. We motivated the use of, and focused on, some of the unsupervised-learning algorithms that we tested at Fourthline in order to train algorithms into fraud detectors, using historical data. We showed how each of these algorithms performs and provided insights into their applicability to the problem at hand and beyond.

However, the story does not end here. Unsupervised learning is just one of the things we have tried and, as we shall see in a future post, supervised methods can outperform the algorithms we presented here. So, stay tuned for the follow-up article.

Vuk Glisovic has performed part of the analysis and modelling presented in this post. Sebastian Vater has supervised this project. The entire AI team at Fourthline has contributed through useful discussions and suggestions.