On the Road to Detect Financial Crime — Isolation Forest Robustness

Published in

ING Blog

9 min readApr 7, 2020

An enormous amount of transactions is made every day. ING takes its responsibility and also has the obligation to prevent financial crime, so there is a great need to identify which of these transactions are linked to such crime. These transactions usually differ from ‘regular’ transactions, yet the question is: How we can find them without manually reviewing every single one? The number of transactions which are already identified and linked to financial crime (we call this labelled data) is very low. This makes learning and generalizing from these labels very difficult. When we would build a classification model, it can only learn from these known historical cases. However, it is very likely that this historical data also contains cases of crime that are not labelled as crime, because they are overlooked. To uncover these cases, we can resort to another kind of machine learning called anomaly detection.

By definition, anomaly detection is an unsupervised technique, which means it doesn’t use labelled data. As a result, it is hard to assess its performance. Of course, one can manually look at the predicted anomalies, but what about all other transactions? Since this is likely more than 99% of the data, we can only sample from it to study what we have missed. But even then, we probably need a restrictively big sample since the expected percentage of crime related transactions is very low.

Concluding, when using an anomaly detection model, it is important to have a good understanding of what it can capture and what not. In other words, when does it work well and when doesn’t it. As a first step, we chose to experiment with the anomaly detection algorithm called Isolation Forest (more details in the next section). By generating synthetic data we can control the size and shape of anomalous and normal data points such that we actually have a fully labelled data set to assess the model’s performance¹. This will help us in understanding the robustness of the Isolation Forest algorithm.

The outline of this blog is as follows. Firstly, we briefly describe Isolation Forest and in which setting we want to evaluate it. We continue to describe the data generation process, followed by an explanation about the injection of anomalies. Thereafter we describe the performance of Isolation Forest for the generated data sets and how to tune the algorithm towards better performance. We conclude with a summary of our findings and questions that are left open to explore.

Isolation Forest vs. high dimensional data

In this blog we zoom in on the Isolation Forest algorithm, which aims to identify anomalies that by definition are ‘few and different’. In short, the algorithm scores data points based on how quickly (few splits) they are separated from the rest of the data in random decision trees. The fewer splits necessary to isolate a data point the higher its anomaly score. The authors of the Isolation Forest paper describe it as follows.

.. our proposed method takes advantage of two anomalies’ quantitative properties:
i) they are the minority consisting of fewer instances and
ii) they have attribute-values that are very different from those of normal instances.

We will discuss how well the algorithm performs when dealing with high dimensional data. More specifically, can it correctly identify anomalies with the presence of noise (irrelevant features)?

While the authors claim that the algorithm’s performance scales well with many irrelevant features, on real world data sets this is hard to validate since labels are often sparse and subjective. Therefore, we will use synthetic data to better understand the performance of Isolation Forests. In the remainder of this blog, the main question we want to address is: How robust is an Isolation Forest when adding many irrelevant features?

Data generation

When generating synthetic data, there is a plethora of choices to make.
We focus on categorical data since the data we work with is often for large parts categorical. The data generation process is based on a probabilistic graphical model to control the structure (dependencies between features) in the data.

First, a graph is constructed for which the conditional probabilities will be chosen randomly. Thereafter, data is generated by sampling based on the dependencies defined by the probabilistic graphical model.

In the figure below we can see the graph used to generate synthetic data. The arrow from the first (r1) and second (r2) root node to the first child (c1) implies that the value for the first child (c1) will depend on the values of the root nodes (r1 and r2).

Graphical model used to generate synthetic data

For example, consider that nodes r1, r2 and c1 all have the two possible values A and B. Then node c1 contains 8 conditional probabilities describing the probability of observing a value for c1 given the values of r1 and r2:

P(c1=A | r1=A, r2=A) and P(c1=B | r1=A, r2=A)
P(c1=A | r1=A, r2=B) and P(c1=B | r1=A, r2=B)
P(c1=A | r1=B, r2=A) and P(c1=B | r1=B, r2=A)
P(c1=A | r1=B, r2=B) and P(c1=B | r1=B, r2=B)

Here P(c1=A | r1=A, r2=A) describes the probability of observing value A for c1 when you already observed value A for both r1 and r2. Note that, each of these four pairs above sum to one. That is, given the value for both r1 and r2 we know that c1 must either be A or B.

Anomaly injection

The next step is to inject anomalies into the data. Based on a contamination factor (percentage of data points that is anomalous) data points are modified to represent anomalies. This is done by setting one or more features to values unique for anomalies.

For example, consider feature c1 with a cardinality (possible number of feature values) of 50. When we want to transform an observation into an anomaly, we can choose to change its value on feature c1 to a value unique for anomalies (e.g. somewhere between 50 and 60). For each anomaly we change a subset of its feature values. As a result, the generated data contains anomalies that are ‘few and different’.

How many features to change?

Considering the example graph above, the first question we want answered is:
On how many features do we need to change its values for a data point to be identified as an anomaly by Isolation Forest?

In the figure below, we can see the result from running some repeated experiments; the boxplots visualize the variation over the repeated runs. The figure shows that after changing values on one or two features anomalies are already quite easily separated, however from three features onwards the separation is perfectly captured by Isolation Forest. In practice, this means that data points only have to be different on a few features in order for Isolation Forest to be able to detect them (in a setting with few features).

The more different the anomalies the better the Isolation Forest’s performance

Is Isolation Forest robust to noise?

In our example, when changing values on all six features for anomalies it is easy to separate them with Isolation Forest (since these data points are very different). However, in a setting with higher dimensional data we expect many features that don’t contain useful information. In other words, these features will have similar values for both anomalies and normal data points. Such features can be regarded as noise. We will test the effect of adding such noise (many irrelevant features) to the data. For our example this means that we add many features with different probability distributions that are not related to all other features. Again, these features contain no information to decide if a data point is an anomaly or not.

In the figure below we can observe that the performance of the Isolation Forest (default hyperparameters) goes down significantly with the added noise.

Isolation Forest’s performance drops significantly when adding noise (irrelevant features)

Checking assumptions

Given the data generation process, even with 1000 irrelevant features, we know that anomalies should be easy to separate since they have unique feature values. Reflecting on the definition, the anomalies are ‘few and different’. A simple PCA decomposition confirms this claim. From the figure below we can see that it is easy to separate the small group of anomalies from the normal cases if we look at the first two components.

First two PCA components show that anomalies are ‘few and different’

When plotting the Isolation Forest (default hyperparameters) results (same data with 1000 irrelevant features) on these components we can observe that it is not performing well.

Visual representation (on first two PCA components) of low quality predictions with Isolation Forest

Tuning the algorithm

In the Isolation Forest paper the authors state:

Empirically, we find that setting sample size to 256 generally provides enough details to perform anomaly detection across a wide range of data.

Unfortunately, for our generated data sets the default hyperparameters for Isolation Forest do not seem to work well. We’ve tried to mimic a realistic setting while generating data. More experiments are needed to better understand when Isolation Forest can be expected to perform well.

Next, we explore if we can improve performance by tuning the model parameters. We use the same data generation process as before with 500 irrelevant features. In the graph below we can observe what happens when we change the sample size parameter; the performance steadily increases when increasing the sample size. In the second plot we see that runtime scales approximately linearly with the sample size. If the runtime allows it, as a rule of thumb it makes sense to run your algorithm with bigger sample sizes as well.

Performance steadily increases when increasing `max_samples` parameter

Runtime also steadily increases when increasing `max_samples` parameter

We ran a similar experiment for the `max_features` parameter which didn’t seem to have a clear effect on the performance. Another experiment for the `n_estimators` parameter shows a small bump when increased from 100 to 200 in our setting. Again, when time allows it, using more estimators seems a good choice.

Concluding

Regarding the question of Isolation Forest robustness, from our experiments we conclude that there is a limit to the number of features you want to feed into your model (i.e. for features for which you are not sure if they are relevant).

Further we found it to be very important to tune the model hyperparameters, most importantly the sample size. Unfortunately, in real life you (often) don’t have labels which makes it much more difficult to know what parameter settings to use. Therefore, playing around with synthetic data can help to better understand when to use which settings. Hopefully this blog gives you a head start in using Isolation Forest to detect anomalies in the search for financial crime.

Open questions

Exploration often yields more questions and this time is no different. Below we’ve listed a few of our follow-up questions:

How do other anomaly detection algorithms compare to Isolation Forest in these settings?
How will Isolation Forest perform if we run similar experiments on real world data sets (to which noise is added)?
How will Isolation Forest perform when dealing with continuous (or mixed) data?

Feel free to clone the code repository and experiment yourself. If you do, please keep us posted!

Co-author of this blog: Nathan Bijleveld

¹ In this blog ROC AUC is used as a performance measure since labels are available for examples with synthetic data. Note that it can be deceiving to use classification performance metrics since the concept of anomalies is quite subjective. Using an anomaly detection algorithm is not the most suitable approach for a classification task. In a setting where labels are collected to score your anomaly detection algorithm, we suggest to move to an active learning approach. That is, use the identified anomalies to find interesting cases to label to improve a classification algorithm.