Deep learning for fraud detection in retail transactions

An overview of challenges and deep learning-based solutions

Published in

Walmart Global Tech Blog

14 min readNov 11, 2020

With the ever-growing trend of online shopping, e-commerce and e-retail, an unprecedented number of new online users and new channels for online shoppers have also created more opportunities for fraud and abuse. To understand the scope and complexity of the problem, here is an example. Considering the possibility of identity theft, account takeover or stolen credit card, how can we determine if the new and existing customer making a purchase or returning an item and getting a refund (either online or in-store) is truly legit?

Fraud detection is an example of anomaly detection, which is a broader topic in machine learning and artificial intelligence (AI), and suffers from the uncertainty in defining an anomaly (or outlier) and difficulty in outcome verification and performance monitoring. Most anomaly detection methods could be considered as unsupervised or semi-supervised learning problems, however if we have enough labeled or verified data in our database to learn from it, then supervised learning could be used to build the detection model.

Apart from enormous data volume and complexity in financial transactions, fraud detection has a few challenges to overcome, to be summarized next. In this post, we focus on artificial neural network (ANN) or deep learning-based solutions for fraud detection in retail transactions and look into a few topics in the field. As in many other machine learning problems, there is the promise of performing better with deep learning (DL) methods. The DL benefits include a much lower need for feature engineering and a better learning and performance as we get more and more data.

What are the challenges in detecting fraud?

As can be seen from the example discussed, solving most of the fraud and abuse detection problems requires access to lots of current as well as historical information about the customer and the transactions including the customer’s shopping profile, the relationship between the customer making the purchase and other customers, and where and how they made any purchases or returns in the past. Then there is the issue of how and when to know for certain that the purchase is fraudulent. Another issue to discuss here is that classical solutions for fraud detection require a lot of data preprocessing and engineering steps. Specifically,

Building a fraud detection model is a machine learning task which deals with a difficult problem caused by having a high-dimensional feature space. After preprocessing, quantization and feature embeddings we end up with a large number of input attributes or features of various kinds. We get these features from the current transaction combined with attributes generated from historical data about the customer, and all related past transactions and relevant events.
One of the main challenges in financial fraud detection is that only a very small percentage (much less than 1%) of transactions is fraud: an imbalanced machine learning problem. As a consequence, it is difficult to learn how to identify fraud cases with high accuracy, and at the same time to limit the false positive rate (FPR). Merely focusing on increasing the fraud detection (or true positive) rate results in a higher FPR. If a normal transaction is predicted by mistake to be fraud (i.e., false positive), this will result in lost sales, but more importantly it leads to customer dissatisfactions followed by further involvement of customer service (i.e., a higher labor cost) to investigate the case and help the customer who is upset and feels insulted. Therefore, a detection model with a higher FPR will translate to a higher customer insult rate, and possibly results in losing some legitimate customers. Not all customer insults are followed up by contacting customer service.
Another challenge in building or developing a fraud detection model is the data labeling problem (required for supervised or semi-supervised learning), i.e., determining if a transaction was actually fraud.
The other issue regarding data labeling is that there are various types, degree of complexities and forms of fraud. Furthermore, fraudsters continuously use new ways to commit fraud. Therefore, our historical database might be lacking such fraud or abuse patterns. In such cases, an unsupervised or semi-supervised anomaly detection model might help.
When performing fraud detection, the output of the detection model for a given transaction could be a continuous ‘fraud score’ or fraud probability, and we usually need to pick some thresholds in order to convert it to final decisions.
Currently deployed or working systems for financial fraud detection includes a combination of several rule-based decisions, several steps of data preprocessing including extraction of useful attributes and network information from historical data, and finally a significant amount of feature engineering before feeding data to detection models based on machine learning, (popular machine learning methods include unsupervised clustering methods like ‘k-means’, ‘isolation forest’ and ‘subspace clustering’. Popular supervised methods include ‘logistic regression’, ‘random forest’, ‘gradient boosting machine’, and ‘XGboost’). In general, this means a high maintenance cost.
Even though a deep learning-based solution still needs data preprocessing and feature embeddings, its need for feature engineering is less than classical (non-deep learning) solutions. The problem with deep learning detection models is the difficulty in training or building them, and usually a higher latency in generating a prediction, which makes it harder to scale in operation or production mode to serve a huge number of transactions per second.

Deep learning-based solutions

We will review a few methods for fraud detection in more detail in the remainder of this blog, including encoder-decoder structure, generative adversarial networks, other semi-supervised methods, supervised methods, and transfer learning.

Encoder-decoder structure or auto-encoder

Due to the abundance of unlabeled data as well as the difficulty and uncertainty in labeling data (fraud vs normal), it is not unusual to cast the fraud detection as an unsupervised or self-supervised anomaly detection problem, and an auto-encoder (AE) might be good solution for it.

The basic objective in auto-encoders is to learn a compressed representation of data or to learn a generative model of ‘normal’ data. Then we can detect ‘abnormal’ inputs by checking whether the reconstruction error is beyond a threshold. AE has an encoder part which compresses the input features into a bottleneck or latent space vector, followed by a decoder part which basically creates or generates an output similar to input based on the compressed latent vector.
Since AE learns a compact generating model from a large set of normal data, the output will be most likely different from input when an abnormal data is given as input. The reconstruction error (i.e., the difference between input and output) can be used as a fraud score for our purpose.

There are a number of auto-encoder methods, and one of the most promising ones is the variational auto-encoder (VAE) [Kingma and Welling. See this, as well]

Fig. 1. Basic structure of a variational auto-encoder (VAE). It includes probabilistic encoder and decoder structures. The model learns to reconstruct the input, and the output is controlled by the random latent vector Z, which is a low-dimensional representation of input.

VAE is a probabilistic generative extension of the initial AE structure. There are two compressed vectors estimated by VAE’s encoder: mean and standard deviation, representing the probabilistic distribution of data in the compressed low-dimensional space. The latent vector Z samples from the distribution to generate an output, which is very ‘similar’ to the input X when data is normal and similar to the majority of data seen during VAE training.

VAE was used for anomaly detection and fraud detection in several datasets [see, e.g., the works by Jinwon, Alazizi, Xu and their co-authors]. Note that each of encoder and decoder parts could be a convolutional neural network (ConvNet, or CNN), a recurrent neural network (RNN), or a fully-connected network (also called feedforward network or multi-layer Perceptron) structure or their combinations, depending on the type of input data, and the designer’s choice.
Another AE structure used for anomaly detection is what is called ‘robust deep auto-encoder’ [Zhou, et al., 2017], which promise to handle training data that are not cleanly normal, i.e., it might contain abnormal or outlier samples. The method is based on ‘robust PCA’ and works by splitting the input data X into two parts, X = L + S, where L can be efficiently reconstructed by auto-encoder and S models the noise and outlier component, which are difficult to reconstruct, and then minimizing a regularized objective (or loss) function with two items.

There are some challenges though in using auto-encoders, and they do not always work as expected. A limitation is that they might not work well when the input data is very noisy or includes lots of irrelevant or mostly random features or attributes. It is because the auto-encoder will have hard time to learn to reconstruct those attributes. Also, the decoding or data generation process is lossy and not perfect, mostly due to the compression used at the bottleneck, which might remove some useful pieces of information. Auto-encoders might result in sub-optimal solution to the problem, as the training of auto-encoders are usually done independently from the final decision part where we report the anomaly detection (whether it is applying a simple threshold, or it is another machine learning model applied at the final stage over reconstruction error to come up with the final fraud detection).

Generative adversarial network (GAN)

GANs are an extension of VAE but do not require an explicit probability density estimation. GANs are made up of a combination of a generator part and a discriminator part, competing with each other, in which the generator creates synthetic or fake data from a random and compact latent space vector and tries to make them as similar to real data as possible, and the discriminator learns to identify whether the input data is real or fake, challenging the generator. The generator’s latent space vector determines the distribution or properties of generated data, [Goodfellow. See this as well].

GAN’s training process usually comprises simultaneous optimization of multiple objectives (e.g., in the form of minimization of multiple weighted loss functions). For example, training may include minimization of a loss function which is a weighted or regularized combination of both generation error and discrimination error, but several GAN versions are developed by researchers which vary on how such optimization are done and may include other error or loss terms.

If we have some labeled data, we can use them toward a better training of the discriminator model. In terms of using data labels in the training procedure, there are unsupervised, semi-supervised and supervised versions of GAN. In the supervised version, the discriminator part of GAN learns to output class labels for the input data as well or use the labeled data as an additional (supervised) loss term during training.

Fig. 2. A semi-supervised GAN-based model for anomaly detection. The generator and discriminator networks are learned using a training dataset by optimizing a loss function which includes a combination of an unsupervised and a supervised loss terms, [He, et al., 2017].

For the data generator in GAN, it is common to use a decoder part of a VAE structure, as we discussed before. Note that VAE is an unsupervised learning task for data generation, and GAN adds a supervised learning component, the discriminator, in order to challenge VAE and help improve its performance, by simultaneous learning of the both the generator and the discriminator parts.

Fig. 3. Structure of a VAE-GAN model: The encoder decoder parts make a variational auto-encoder which are trained in unsupervised way using all normal data. The discriminator can be trained using normal data as well as any available labeled abnormal (fraud) data, [Kimura, et al., 2020].

In using GAN for anomaly detection, in another method called AnoGAN, for an input data sample X, the latent space is searched for a sample Z, so that the corresponding generated synthetic data and X are similar. After training, since the latent space covers the distribution of normals, there will be big differences between real data and generated fake data when the input is abnormal.

Fig. 4. GANomaly structure: G, E, Disc and A denote generator, encoder, discriminator, and ‘anomaly score’ respectively. The ‘anomaly score’ in this method is the encoder loss in the ‘latent space’, i.e., the difference between Z and its reconstructed version, E(G(X)). The loss function used to optimize or train this structure is a combination of several parts, including the construction loss, encoder loss and adversarial loss, [Akcay et al., 2018]. This structure is similar to bidirectional GAN (BiGAN).

There are other versions of GAN which might be useful here. Conditional GAN (CGAN) model learns to generate samples conditioned on an extra input, like the data label. In CGAN the data label vector is given both to generator and discriminator as 2nd input. After training, CGAN gives us the capability to generate specific types of data we want.

Wasserstein GAN (WGAN) is an extension of GANs with the goal of interpretability and making training more stable, (see also this). WGAN uses Wasserstein distance (or earth mover’s distance) to measure distance between distributions. Earth mover’s distance is interpreted as the cost of transporting the probability mass of one distribution until it matches the other one, and the two distributions does not need to have overlaps. A property of WGAN is that it is able to generate categorical data, which is a useful property here as some of the input features or attributes for fraud detection are categorical data.

Apart from using the ‘generator’ in GAN to learn the data generation model and the underlying compact latent space, once trained, it could also be used to create additional synthetic data to augment the training set for some purposes. Note that overall the real data is more valuable than synthetic data created by GAN, but GANs can help augment data for certain conditions.

In general, compared to auto-encoders, GAN’s have more loss components to be optimized during the training process, and could generate a more realistically looking synthetic data, and finally can be used to detect complex abnormalities. However, it is more difficult to train GANs and the training might not converge to an acceptable solution. Finally, since the basic version of GANs are trained with the main objective of data generation from a compact latent space rather than anomaly detection, the outcome might be suboptimal for our purpose here.

Other semi-supervised methods

There are several other methods for building semi-supervised as well as weakly supervised anomaly detection models:

The deviation networks [Pang, et al., 2019] uses a large number of unlabeled data along with a small number of labeled anomalies and a Gaussian prior over anomaly scores to train the anomaly detection model. The idea is to use a prior probability information of normal data to make sure that for anomalous data we have a significant deviation of anomaly scores from that of normal data located in the distribution tails.
The semi-supervised anomaly detection method by Ruff and colleagues [2020] is based on minimization of an integrated or combined loss function which includes learning unlabeled normal data representation as well as learning anomaly scores for a sparse or limited number of labeled anomaly samples, making sure anomalies are further away from normals. It uses the idea that the entropy of latent distribution should be lower for normal data as compared to abnormal data, and therefore their method works by minimizing the entropy of latent distribution for normals and simultaneously maximizing it for abnormal data. They conclude that when some labeled data is available, a semi-supervised method employing those data labels works better than building a completely unsupervised anomaly detection model (and ignoring all labels).

Using a supervised method for fraud detection

If we have a large enough set of historical labeled data sets, where we know with high confidence for which data samples and when chargebacks happened, which transactions were fraudulent, when we had customer insult events, and when return abuse cases happened, etc., a supervised deep learning-based method for fraud detection could provide a high accuracy and performance.

Since there are various kinds of input features or attributes (including point data as well events and sequential data), two approaches can be taken, regarding the feature extraction process:

We could directly feed sequential data and event time series to a ConvNet or RNN deep learning model, in which the feature extraction task will be automatically done by the deep learning model; this is an example of what is called an end-to-end machine learning design.
Apart from information regarding current transaction, we could also derive or extract various relevant attributes or features from sequential historical information as well as the past network or interaction information. Then we could feed all these extracted quantitative features into a deep learning model. Compared to traditional machine learning methods, deep learning models behave better on high-dimensional data.

To tackle the problem of imbalanced training dataset, over-sampling of the fraud samples and under-sampling of normal samples are very helpful.

Transfer learning

If we have a DL-based model originally trained for a fraud detection problem or application with a large dataset, then we could use the model for another similar problem or application by minor tuning its parameters, even if the new problem has a small dataset size. This is called transfer learning, transfer of learning, or ‘domain adaptation’. In other words, let’s say our other new fraud detection problem has insufficient data to train a new detection model from scratch, but it uses the same input data attributes. Then if we have a good and well-trained fraud detection model that works well for a different but similar problem, we can just use that pre-trained model as the initial network to perform a limited amount of additional training or fine-tuning using the small dataset we have in our new problem. By doing so, we are actually generalizing the experience and transferring the knowledge learned from solving one problem (sometimes called the ‘source domain’) into another problem (called the ‘target domain’) with insufficient training data, resulting in an improved performance that will not be possible otherwise. [see e.g., Zhuang, and Lebichot].

There are various ways we can use transfer learning, ranging from unsupervised to supervised methods. In supervised methods, for example, since the initial layers of a DL model perform the feature extraction task, one way is that we keep the initial layers of original pre-trained DL model intact, and only fine-tune or adapt the top layers (i.e., layers close to the output) using our new but small dataset. Another way is that we allow tuning all parameters or network layers of the original DL model but limit the amount of tuning by reducing the learning rate, by using different learning rates for each layer (in which layers close to the input get a very small learning rate), or by limiting the number of training iterations.
Another area, where transfer learning can be helpful is when we do data augmentation and use synthetic data, particularly for time series or sequential data. Note that anomaly events are very sparse in real life, and therefore generating synthetic anomalies could help the training process. For example, Wen and Keyes pre-trained a ConvNet model for anomaly detection on a large set of synthetic sequential data, then fine-tuned the model parameters on a small set of real data.

Concluding remarks

Deep learning models could help to reduce false positive rate (FPR)(which means less customer insult) and reduce chargebacks and fraud, when compared to traditional machine learning methods. Note that even a small reduction in FPR and chargeback rates (like 5% to 10% reduction) might translate to a noticeable increase in business profit or performance when we are dealing with a large-scale and competitive e-commerce.

A key benefit of deep learning based models is their learning capacity, meaning their capability to automatically extract more complex features, learn more complex detection models and to improve performance with more or bigger training data, as our database size grows. Another key benefit is their capability to handle various kinds of and lots of input features or attributes, not to worry much about the statistical properties of attributes, their dependence or correlations, and the feature engineering work. Many applications of deep learning methods are end-to-end solutions for very complex problems, meaning less challenge regarding the feature extraction or the curse of dimensionality. Still, however, given the inherent complexity and challenges in fraud detection, as we discussed in this post, there is room for improvement and need for research: there is a need to design a suitable deep learning method for anomaly detection purpose, to optimally explore rich information in unlabeled data, to improve how various kinds of input data and attributes are processed then inputted into the deep learning model, and at the same time to lower the computational complexity, so that it can scale in production to process a huge number of transactions per second with a reasonable detection latency.

I would like to thank my Walmart colleagues James Tang, Henry Chen, Julia Albath, Camilo Rivera, Jan Johnson, and Jing Xia for reading the draft and providing their thoughtful comments, which helped to improve this post.