Fraud Detection: Anticipating the fraudsters

David Gamba
14 min readMar 4, 2019

--

Welcome to part two of our short series describing the most important highlights of our analysis of the Paysim fraud detection dataset. On this occasion, we discuss how to make the machine learning model that predicts fraud robust. To get the full picture and context you’re welcome to review the introduction , where you can also find links to the other parts of this series. Also, the code can be found in the git repository. The relevant file for this article is /fraud_models/robust_models.Rmd.

The results of our first iteration are bittersweet; despite the good model performance and the clear characterization of fraudulent transactions, the results are flimsy as minimal changes in behavior of the fraudsters could easily derail model predictions. Our objective is then to improve the current models considering the possible responses of the fraudsters in the system such that we can have a reliable model over time.

First, we must look back to our current characterization of fraudulent transactions and determine what could be the course of action of a fraudster once they realize that most, if not all of their transactions are being blocked. For example, practically all frauds occur in the critical line where the amount of the fraudulent transactions equals the original balance of the victims, or to put it eloquently, the fraudsters just take all the money and run. This begs the question: What would be the effect of reducing the amount just by a little? What is the result of the model in such scenario?. To answer, we have taken 100 identified fraudulent transactions by the model at random and reduce their amount by just 1, a operation that implies not only changing the maximum amount feature, but the balances and new amounts both in origin and destination account, attributes also strongly considered by the models.

Turns out that the true positive rate, TPR, is 0 for models such as decision trees. This result is obvious when looking at the model generated. The first split node is maximum_transfer and on the false side of this split it predicts legitimate transactions only.

Results of testing 200 altered fraudulent transactions

Results are similar for logistic regression and even ensembles, such as randomForests. This bleak outcome stems from the fact that the most important feature for the models represents whether in a given transaction the amount equals the original balance of the origin account. In consequence, even a minuscule difference in amount and oldbalanceOrg makes the condition of maximum_transfer false. Simply put, the models are too dependent on an easier to modify feature.

Lets review the issue at hand: from what we’ve seen in the previous article of the series, it is possible to train almost perfect models that can detect fraud, however, performance of these models is highly susceptible to minimal changes in the most important features (keep in mind that here we are only describing in detail one of such features, maximum_transfer). Clearly, the objective of the client is not to have a working model for only a short period of time. A method that could work is developing new models once many undetected frauds go by, yet such reactive methodology is doomed if all models can be made easily fallible, as we have seen here. In addition, following Juan’s discussion on the previous article, this approach could lead to an ultimately ineffective cat and mouse game. A better approach would be to look for more robust models, flexible enough to accommodate for variation on the attackers modus operandi.

Following the previous train of thought we tested four different ideas:

  1. Changing models. Well use different, slightly more complex models that still include the feature maximum_transfer, but that we hope do not rely too much on it, as in models that include feature subsampling like randomForest or XGBoost. Or perhaps, models that constrain the parameters dependent on that feature (regularized logistic regression).
  2. No maximum_transfer. A very simple model with only maximum_transfer is no longer possible. We try to develop models without the dependency on the maximum_transfer feature by removing it, and replacing it for its original feature, amount. This also implies adding features to the models.
  3. Making maximum_transfer more robust. Change the definition of our star feature such that it becomes more robust, or replace it by a similar feature that closely relates to it.
  4. Fraudster shift simulation in train. Simulate changes in behavior of attackers in the training data, as a sort of data augmentation.

Approach 1: Changing models

When training a simple decision tree on this data, it turns out that the maximum amount is always selected for the first split, this of course has the issue that a change to maximum_transfer completely derails what comes next on the tree. Even for for logistic regression a similar issue persists; the weight assigned to the coefficient accompanying the maximum amount term is too big, and a change in maximum_transfer affects heavily the estimation of fraud probability.

One such approach are ensemble models that use subsampling of features. Such as random forest. A random forest is a group of decision tress, each trained with a different, independant subset of data. A single tree can receive a subset of the datapoints (subsampling/bagging), and also, a subset of the features of the model (feature subsampling). The last one is key to reduce the strong dependency of the model on a single feature. Applied in the case of fraud, one tree, say A, of the forest receives the maximum_transfer feature for training while another, say B, does not. Tree B will not have any information from the maximum transfer features and is expected to form completely different patterns to determine whether a transaction is fraudulent or legitimate. This multiplicity in ways to arrive at the answer gives random forests some resilience to changes in features.

However, the results defy the expectations. We tested a group of fraudulent transactions that only had the maximum_transfer feature altered, of course, from TRUE to FALSE. When these observations were presented to the random forest, the true positive rate (fraud detection performance) dropped close to 0. From this follows that feature subsampling and ensembles are clearly not enough to make the predictions robust.

Results for random forest.

Approach 2: No maximum_transfer

Another strategy we can try to ameliorate the problem with the maximum_transfer is to remove the feature completely. This feature was engineered to encase a slightly complex pattern into a single, easily explainable feature. However, we can just remove the feature and trade simpler explanatory models, for slightly more complex alternatives that keep superb levels of predictive accuracy, that is, if the data remains unchanged. Such was the situation we found when training our first models with little feature engineering. Nonetheless, this idea ignores the underlying problem; the distribution of data for fraudulent transactions could be purposefully altered. These complex models may as well be learning the same boundary determined by maximum_transfer through the variables used to create it: the old balance of the origin account and the amount involved in the transaction.

After a test where we reduced the amount of fraudulent transactions in the test by different values ranging from 1 to 10M, we found that for various models, relatively slight reductions in amount (which imply changes in balances) effectively reduce the AUC from 0.99 to around 0.75. Some models, such as decision trees, could still handle changes in amount of less than 1k. However, an attacker that just got access to an account with more than 10M in balance would gladly reduce those 1k from the amount, given that the alternative is being detected and transferring nothing. On the chart we plot the total money lost in the test set due to fraud, which construction can be seen in the repository.

Performance of conditional inference tree with reductions in amount fro fraudulent transactions

Approach 3: Making maximum_transfer more robust

What if we still want to retain the explicability provided by the maximum transfer feature while having a model robust to changes in modus operandi of the attackers?. Accordingly, we could use an attribute that is similar, or represents the maximum transfer, or we could even modify the definition of what it means for a transaction to fulfill the condition of maximum transfer. Certainly, the notorious downside of the boolean feature maximum transfer is that only a minimum change in amount is enough to change the value of the feature. If we consider the actions of a fraudster, it makes sense to think that they will try to reduce the amount they can extract to a minimum, in other words, they would try to extract an amount as close as possible to the maximum amount without making it exactly the maximum transfer amount possible. This small change can still be quantified. Thus, we can construct a feature that encodes the difference of the amount extracted to the maximum amount allowed. There are of course various ways to encode this relationship.

One of these ways that encodes the “how close to maximum allowed transfer” is the percentage of the amount extracted with respect to the maximum transfer value allowed. However, this attribute has a few problems; as the majority of transactions for some reason have greater amounts than the oldbalance of the origin, the relevant values of percentage less than one are opacated. However, a small modification to the feature is possible: what if we send all the values grater than 1 to 0?. It is clear that fraud is unplausible for values greater than 1 or values where the amount is 0. We should stress that here we are making the assumption that this wont happen in the future, and we have yet to confirm this with the business.

Chart of proportion of legitimate/fraudulent transactions by values of perc_max, could be interpreted as the conditional probability P(fraud|perc_max).

Another issue is that again we have the problem that when a given transaction the percentage is not exactly one, then the model will predict the transaction as legitimate. For a tree, we’ll have the same behavior as when the feature encoding the maximum_amount was binary. Test with a logistic regression indicates that even if the attribute is numeric, the feature is behaving as if it were only the maximum amount boolean.

Let us shift focus to logistic regression. First, without the perc_max feature. Despite keeping AUC extremely high, relatively small reductions in amount of around 10k for new test observations hit the model hard, reducing the true positive rate to only 57%. Again, this indicates that the threshold for classification is varying based on the amount reduced, this could be an issue if we need hard classes, because we don’t know the amount that is gonna be reduced beforehand in order to set the detection threshold properly. In production, a scheme that only cares about ranking — say, give me the 100 more likely fraudulent transactions — might work, however, hard classification of classes, with specific thresholds complicates.

To get details on the features and preprocessing used for this logistic model, please refer to the code.

Performance of logistic regression without perc_max feature.

Now, with perc_maxincluded, AUC and TPR are more stable until values of 10k, where it behaves pretty similar to what we had before, on the downside, this models turns out to be worse when the amount that the fraudster is willing to give is grater than 1M. This in an issue if we try to protect huge accounts. We also have the issue that the TPR abruptly falls after some amount, indicating that the threshold changes with the amount that the fraudster is willing to sacrifice. For this experiments, the model without the feature encoding percentage of maximum protects better against fraud cases, a result that could potentially ease the processing of fraud claims.

A slightly similar, yet different approach, is to use the difference of the amount to the maximum transfer allowed, on top, we can take any function of the difference, as for example, the difference squared, that could even make more sense to quantify the distance in amount to the maximum transfer allowed. This follows from the observation that for high balance account, it is even more unlikely that these accounts will extract all of their money, and also its the case that fraudsters have more wiggle room to reduce the amount that they are going to extract. Therefore, we can penalize larger differences in amount harder by squaring the difference. We did not test with these alternatives using the squared distance but please feel free to fork our project and try it by yourself.

Heretofore we used numeric attributes to encode the concept of being close to maximum transfer, however, we still have lost the simplicity of the first attribute maximum_transfer, a simple yet powerful indicator of fraud. To accommodate for the numerical features, models such as trees have to be a little more complex. Interpretation of results also gets some degree of complexity added, from the having or not having the condition, to the how much difference. One option that may work to keep the original boolean feature is altering the definition of the maximum percentage, such that it becomes broad, we can even use discretization of the numeric attributes that we just tested to generate this modified version of `maximum_amount`. For example we can chose certain threshold on the feature of maximum percentage, max_perc. Of course the choice of this threshold requires careful thought. We can even devise a scheme with varying thresholds to binarize the variable depending on the amount; for small amounts the threshold could be close to 1, while for large amounts, the maximum percentage threshold can increase, as to accommodate more room that an attacker has when tampering with large accounts.

Approach 4: Fraudster shift simulation in train

Approaches until now have a small caveat: during training, the model does not “experience” any of the changes that happen later in the test data. After model deployment, there will likely be a change in the expected behavior of attackers when they notice that their transactions are being blocked by the system, and this change in behavior is not being represented in the training data. At the beginning, an attacker might not know what features the model is looking at, however, it would be imprudent to assume that the model can be kept secret from attackers, especially with the original model involving maximum_transfer being this simple. Soon enough attackers will easily notice that the critical factor for detection of their illegitimate endeavors is if they take all the money from the account.

We could then think of letting the model see examples of future fraudulent behavior, as current models have the problem of overfitting in a sense, they will only account for training examples that transfer the maximum possible. It makes sense to include different examples in the model creation, that present frauds that do not extract the maximum amount, as a sort of data augmentation. However, we don't know what will be the distribution of change in amount that attackers may try, still, we can start with simple ones, lets assume that the distribution of the amount that an attacker is willing to give up is a uniform distribution between some number k and the maximum amount. This means that random amounts are reduced for fraudulent training data.

Under this approach, the results seem promising: the models have high AUC under different amount reductions and also maintain relatively constant TPR and TNR even when the amount is greatly reduced, this means that even attackers willing to give a lot of their bounty will be detected by the model. The results are even more encouraging considering that we tested with a logistic regression, that under the previous approaches, such as attribute changes behaved poorly, reducing the TPR to values close to 57% depending on the amount reduction, a situation that no longer occurs in this model.

There are nonetheless some issues with this approach. First, the model assumes certain distributions of change in amount that attackers may take, yet the real distribution of change in amount is unknown and will be until the attackers start changing their modus operandi. To test, we could try different distributions for the change in amount in the test set, to see what works. Second, the model still has some issues regarding interpretability, despite being a simple logistic regression. For the data augmentation shown, no engineered features such as perc_max were added, maybe by using these features the interpretation could be made easier while retaining high performance metrics. Keep in mind that we also only used logistic regression, it is left to see whether other models such as decision tress behave nicely with data augmentation. Regardless, this is the best approach we found in our exploration, as it gives high performance and small amounts of money lost.

What have we done?

Evaluating all the different approaches was an interesting journey. In conclusion, we observe that data augmentation in train goes a long way maintaining the model robust, even though it is not the definitive solution to adversarial changes. However, it is interesting to see what possibilities we could have improving the robustness of models. Maybe, the data augmentation approach can even be combined with the others that we have seen in order to have robust models that still use easily explainable variables, an experiment that you could test.

We still have to think of the shortcomings and limitations of the different approaches:

We tried to anticipate the fraudsters taking into consideration just a single change in the modus operandi, one of changing a single variable (amount). Despite the fact that this change is a highly plausible one, the attackers have to their disposal many variables they can change or ways they could try to change the modus operandi, such as the original balance of the destination account. Further analysis needs to account for changes in other variables, maybe by reaching solutions similar to the ones we have proposed. In a wider analysis, maybe simulation approaches will not work as well when many variables change at once, yet this remains to be tested.

Another shortcoming of the proposed approaches is that the models are static, even though they consider future actions, the models do not change to accommodate for changes in pattern of fraudulent transactions. Of course, this implies a very different approach for the model building , considering that some sort of retraining or online training and evaluation need to be performed regularly, How would these models accommodate for change?. Sadly, these approaches are too ample to cover in this article, but we can think of analyzing these cases in a future article.

An interesting question related to the previous idea is how to actually detect changes in the fraudsters pattern, maybe evaluation of the performance of the model in production could be used; that is, if there are again many complaints of fraud over time, that could signify that the fraudsters are fooling the model. Nonetheless, there are different approaches that look for example at the distribution of data over time to determine whether the data has changed patterns and the model have to change.

Overall, we saw that despite some shortcomings, many approaches are available to solve an issue such as lack of robustness in machine learning models, and we have only skimmed the surface of the possibilities!. We hope that you can keep on experimenting on this dataset with innovative approaches that take the models (and fraudsters) to the limit.

--

--