Multi-Domain Fraud Detection While Reducing Good User Declines — Part II

Nitin Sharma
The PayPal Technology Blog
8 min readOct 15, 2020

In part I of this two-part series, we outlined the approaches for multi-domain fraud detection while also optimizing for feature stability in post-deployment scenarios. This post will demonstrate approaches to reduce decline of good users, incrementally, on the top of the existing model architecture.

Photo by Brooke Lark on Unsplash

Fraud detection can involve multiple objectives where-in not only do instances of fraudulent transactions need to be declined with high catch rate, but also good users need to accurately identified and approved for best customer service experience. Assuming that a false-positive is an instance of a good-user decline, methods in traditional machine learning can involve reducing such declines through cost-function adjustment, such that, following first round optimization, a second stage training process can be introduced to optimize for false-positive reduction through an increase the penalty on such instances using a weight w, as follows:

The above equation is simply one of the several possible weighting schemes, where-in the second term can be weighted higher so the learning process optimizes for reduction in false positives. This approach might involve iterative/incremental training of the model to fine tune to find the right weights, and find the right balance between fraud catch rate and false positive rate.

With the AI/ML state-of-the-art evolving rapidly in the domains of computer vision, text/NLP and speech domains, adapting such methods to the problem of fraud detection often involves re-formulation, while being able to draw meaningful analogies and connections. In the subsequent section, we will demonstrate how two such methods, as listed below, are adapted to improve identification of good users in fraud detection context, after the associated model has been sufficiently optimized for improving fraud catch rate:

  1. Online Hard Example Mining For Mini-Batch Generation
  2. Transfer Learning Using Generative Modeling Contexts

Online Hard Example Mining

The core idea of Online Hard Example Mining (OHEM) hinges around training of region-based object detectors for image detection in computer vision¹ ² . We will first introduce the general problem, and its applicability to image detection domain, and then draw the analogy with the good user identification problem in the fraud modeling, while also describing the algorithm/training process.

Background and Context from Object Detection

Images comprise of multiple objects of interest in either the background or foreground, that might need to be correctly identified.

(Image Source: Jin et. al 2018²)

In dynamically evolving video frames, one of the ways in which the problem of object detection (for example, a face or a walking pedestrian) is approached is by dividing the frame into a large number of regions and learning to identify each region separately. Objects of interest are annotated in a training set of images and then learned to be detected through traditional image classification models. Such classification is challenging because the sparse regions depicting the object of interest might be surrounded by a large number of regions containing starkly different background objects with vaguely resembling characteristics (such as shape, color). Thus, training datasets are characterized by large class imbalance between the annotated objects of interest and those in the background. At inferencing time, the classifier can potentially incorrectly detect a hand as a face, or a parked motorbike as a walking pedestrian, as shown in the cohort of video frames above.

Several bootstrapping schemes have been suggested to improve the performance of Region-based object detectors such as Fast R-CNN, R-FCN (Region-based Fully Convolutional Networks) and FPN (Feature Pyramid Networks), which involve a large number of hyperparameters and heuristics. Shrivastava et al. (2016)¹ introduce a novel online bootstrapping approach that mines “hard examples” (false positives or incorrectly identified facial profiles) that significantly reduces the number of hyperparameters with consistent precision. With a bootstrapping scheme that works well with Gradient Descent approaches, the method also demonstrates increased effectiveness with growing and increasingly complex training sets (with object of interest obfuscated by large number of similar background objects). OHEM proposes a modification to traditional SGD by replacing random sampling with one from non-uniform and non-stationary distribution.

Applicability to Identification of Good Users

Just as hard examples are incorrectly identified facial profiles in the image detection problem, in the fraud modeling context, these are good user declines or false positives. We extend the modeling process described in Part I of this blog series, towards exhaustive gap analysis to identify sub-populations or cohorts with higher declines of good users.

Following first pass training of the multi-task model as described in Part I, two passes are made through the network. The first pass involves freezing the network and conducting a forward pass on the sampled data set to compute the fraud risk propensity scores as well as custom-defined losses. Further, after rank-ordering the examples by fraud risk propensity, we identify a decline region/rule using business-defined metrics that establish the balance between risk appetite and approval volume for different segments. In the second pass, mini batch of false positives or good users who got declined (based on decline region rule) is identified. Such mini batches are also further weighted based on segmentation, domain types and thresholds to optimize for higher accuracy of dollar-weighted fraud detection. Subsequent backpropagation involves further fine-tuning the weights only based on the mini batches. Effectively, two passes help identify potential good users who are likely to get declined and then to fine-tune the weights in order to improve classification performance for good users.

Online Hard Example Mining For Good User Decline Reduction

Transfer Learning Using Generative Modeling

We introduce another approach for accurate identification of good users using a generative modeling framework. In a typical discriminative fraud detection model, the idea is to separate good users from fraudsters or to predict the conditional probability P(Y = 1 / X). In contrast, a generative model attempts to learn the multivariate feature distribution (or behavior) of a specific cohort of users (say, good users), capturing the joint probability P(X, Y) or simply P(X) (for a known population or good users). The idea here is to define what type of data/behavior (X) is being generated by good users (or fraudsters).

We adapt the generative context first by identifying a population of good users declined by multi-domain modeling framework based on score threshold defined for decline regions. We further compare this population with the one declined by a competing champion ensemble of models and identify the overlap. These are categorized as instances of good users who got declined as a broader consensus by multiple models with similar objective function but different training methods/algorithms. The goal is to learn the distribution of those good users who consistently get declined by multiple models. These good users are the most prone to erroneous classification likely also due to their feature distribution being in closer proximity to that of fraudsters.

After identifying these users, the context is then formulated as a multivariate feature-distribution learning problem using a cascade of autoencoders. Even though Variational Autoencoders can be used to learn mixtures of data distributions, the training approach assumes that the training instances of financial transaction data are independent and identically distributed. Thus, to account for covariances between samples and for inducing post-deployment robustness of classification of good users, a cascade of stacked de-noising autoencoder and Gaussian process based Variational autoencoder⁴ is trained with good user declines. The generative model cascade learns the distribution of good users who are most likely to be declined by a strong cohort of discriminators. We utilize this cascade in two distinct ways:

  1. At inference time, the incoming transaction is scored by both, the discriminative model (output being risk propensity score) as well as the generative model (which provides a reconstruction error). If the reconstruction error for the transaction is below a pre-defined threshold (as auto-tuned through several rounds of hyperparameter tuning and based on business-defined objectives), the incoming transaction is likely a good user with potential to be declined by the discriminator. This is further verified by assessing if the risk propensity score by the discriminator exceeds the decline threshold. In our experience, it is important to pick a low-enough threshold for reconstruction error, so the bar to identify a good user decline is high enough, and an over-ride of discriminator being effected only in small proportion of cases.
  2. A training time, utilizing the learned representations from the generative model as features to further improve the good user identification accuracy in the ensemble of neural networks or multi-domain frameworks described in the preceding sections.
Transfer Learning Using Generative Frameworks For Good User Decline Reduction

Conclusion

Fraud detection is a complex problem faceted by traits such as diversity and obfuscation of fraud patterns, heterogeneity of e-commerce hallmarked by variety of purposes for which transactions are conducted and temporality of evolution of underlying populations. In our experience, cross-domain learning facilitates simultaneously learning multiple fraud patterns at the same time that enriches the underlying detection context, but also allows discovery of representations otherwise unexplored in the silo-oriented one-domain-at-time learning paradigms. Utilizing the notion of constraint-based clustering, we control for over-fitting on a specific type of fraud pattern, potentially due to greater contribution in the training and evaluation datasets. Further, cross-stitch units allow tighter fine-tuning and control allowing the model parameters to adjust limitedly based on performance gaps revealed in iterative stages of the gap analysis. Robust feature discovery is further augmented by use of noise-induced representation learning and use of representations learned on different time durations. Lastly, the methods of online-hard example mining (where a good user who is declined is identified as an “hard” example) and generative modeling are integrated into the multi-task learning framework, so as to reduce decline of good users, thereby facilitating swift approvals.

Overall, combination of techniques described above yield a proven and robust increase in catch-rate by about 5 % relative increase, as also reduction in false positive ratio (effectively good user decline rate) by >10% in relative margins. The models were also tested for temporal stability through confidence interval based bootstrapping methods, and meet robustness criteria as well. Broadly, such criteria are defined by metrics such as proportion of times (in different time durations such as weekly or monthly) model performance drops below an acceptable business-domain defined threshold, as also the several measures related to variability in model performance.

Please subscribe to our blog on medium or reach out to us at ai-blog@paypal.com with any questions.

References

[1] Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training Region-based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[2] Jin, S., RoyChowdhury, A., Jiang, H., Singh, A., Prasad, A., Chakraborty, D., & Learned-Miller, E. (2018). Unsupervised Hard Example Mining From Videos for Improved Object Detection.

[3] Uijlings, J., & van de Sande, K. (2013). Selective Search for Object Recognition.

[4] Casale, F., Dalca, A., Saglietti, L., Listgarten, J., & Fusi, N. (2018). Gaussian Process Prior Variational Autoencoders.

--

--