Casting Deep Nets on Financial Crime

8 min readJan 16, 2019

Deep Learning (DL) has revolutionized a number of fields in the last decade, ranging from computer vision to natural language and speech processing. At Feedzai, most of the datasets we work with don’t involve unstructured data such as images, natural language, or audio, but tabular data — tables where lines are individual transactions, each labelled as fraud or not fraud, and columns are attributes, such as amount, merchant, or hour of the day. Deep Learning isn’t as popular on tabular data as it is on unstructured data, but the DL community has been evolving at a fast pace, with huge numbers of scientific papers published and lots of resources poured into research by some of the biggest tech companies. Can we somehow benefit from this to improve our fraud detection models?

The typical transactional case: tabular data

Imagine we want to train a model to classify transactions as fraud or not fraud using this (dummy) dataset:

The last three transactions are fraudulent and they do indeed seem suspicious: three 200 USD transactions with the same card on three different websites in 3 minutes. However, if we feed the dataset with these features to a classifier, each transaction will be looked at independently. A model trained on this dummy dataset could not learn that the suspicious activity here arises from the fact that the same card made three payments in different websites over a very short time; it would probably (wrongly) learn that these three transactions are fraudulent because 200.00 USD is a particularly risky amount, when that is likely not the case.

Usually, our fraud detection system handles this by enriching each transaction, or row in the table, with profiles: new columns which convey past information about the entities involved in the transaction. Depending on the use case, these entities could include not only the credit cards, but also merchants, emails, or device IDs — any field of which the history over time may help us detect fraud. These profiles typically involve aggregations over sliding windows of X minutes, days, or even months; some examples would be:

the number of transactions or total amount spent by this credit card over the last X hours or days,
the number of distinct emails used with this shipping address in online transactions over the last X days or weeks,
the ratio of the amount of the current transaction against the mean amount per transaction by this credit card over the last X months.

In our dummy example, computing the count of transactions and the sum of transaction amount per card over 5-minute windows would go like this:

After adding these new columns, an enriched dataset could look like:

In the end, our hope is that these new columns convey useful information about the history of the entities (in this case, the card) involved. If we now train a Random Forest on this enriched dataset, each row is still considered independently, but the information about the past is now contained in it and the model can detect the kind of shady sequences of transactions we’re looking for.

Recurrent Neural Networks

Profiles turn out to work very well in most use cases and they frequently boost our detection results. However, in spite of their excellent results, they still have two downsides:

even with Feedzai’s AutoML ability, it takes time to compute hundreds of profiles before training a model, and
in production, maintaining and updating profiles and related state consumes plenty of memory.

Deep Learning allows us to take a different direction: instead of addressing the issues that pop up when we try to enrich transactions with profiles, we look for solutions that completely forgo explicit profile computations. Could we achieve the same, or even better results? Could there be a way to embed in the model, and not in the features, our belief that history is important and let it figure out how, instead of explicitly computing profiles?

How it works

Recurrent Neural Networks (RNNs) are neural networks that work on sequences of inputs, as opposed to classifiers such as Random Forests or feedforward neural networks which look at individual inputs independently. RNNs revolutionized areas such as natural language processing (where we typically deal with sequences of words), speech recognition (sequences of short sound clips), and video classification (sequences of frames).

You might have guessed: RNNs also apply very naturally to the problem we’re trying to solve. If we know that the history of an entity over time (say, a credit card for the sake of this example) is important, we can move away from classifying a transaction independently and instead classify the sequence of transactions that led to the one we want to classify. To achieve this, we encode knowledge of the history of a card up to the current point in time in a state vector, and then use that state to predict fraud. During training, we want to make the model simultaneously learn:

how to classify transactions based on the state, and
how to update the state at each step of the sequence to make it capture relevant information.

There are several variants of RNNs, including LSTMs and GRUs; we’ve mostly been using GRUs since they are slightly easier to train and the results are roughly the same. The figure below shows how this works when we train the model with historical data.

We start with a state vector initialized to zero for each card.
Then, we take a sequence of transactions of a card, feed it through the RNN and, for each transaction, predict if it’s fraudulent or legitimate and update the state that is passed on to the next transaction.
We do this not with one, but with a batch of cards.
After we have predictions for all transactions for the cards in the batch, we compare them with the labels and compute the loss for each transaction — a measure of by how much our predictions missed the correct labels — and adjust the model through Stochastic Gradient Descent.
Rinse and repeat — keep doing this with different batches of cards and watch the model improve over time.

Note that each card is a different sequence with its own state vector evolving over time, but the learnable blocks in the model (the blocks we’re adjusting gradually to make the average error smaller during the training process) are shared across sequences and steps within each sequence:

the “GRU cell” block, in blue — the block that produces a new state vector for a card, given its previous state vector and an incoming transaction;
the “classifier” block, in orange — the block that produces a prediction based on the state vector and a transaction.

We skipped most of the details, but in fact each of these two blocks is a neural network. Sounds complicated? Not anymore — with tools such as TensorFlow, Keras, and PyTorch around, this is actually easy to implement. Most of these frameworks offer implementations for a number of RNN variants.

But can we use this in a real-time setting?

It may look like this is impossible to use in a real-time setting, considering the latency and memory constraints. After all, to get a prediction for the 100th transaction of a card, we would need to feed the model the entire sequence of 100 transactions — and this number could get much bigger! The trick here is that the information we keep about a sequence is completely encoded in its state vector. If we keep a table in memory with the most recent state for each credit card, when a new transaction arrives we just need to:

Fetch the current state for the given card number from memory.
Feed the current transaction and the state we just fetched to a GRU cell, yielding the new state for this card number.
Feed the new state to the final block to get a prediction.
Update the state table with the new state for the given card.

In practice, the predictions are the same as if we fed entire sequences every time we wanted to score a transaction. By keeping track of the states, we can expect the model to take a similar time to score each transaction, regardless if it’s the first or the millionth one for that card. Memory-wise, we just need to keep a record of the most recent state per card.

Take-away

So far, in every case we tried similar RNN models they outperformed our previous best models (Random Forests, XGBoost, or LightGBM). The table below shows the improvements in transaction recall at 1% false positive rate — a common baseline metric — for two different datasets. In dataset 1, for example, this improvement would save an estimated 1 million USD in fraud per year. These models didn’t need much tuning, so we expect that the results could even improve if hyperparameters were more thoroughly optimized.

The main takeaway here, though, is that these results were obtained without any profiles or explicit feature engineering. Without profiles, this setup would require a fraction of the memory typically necessary to run in production.

Next steps

Not everything is easier with this approach. Firstly, explaining model decisions becomes harder than with Random Forests or Gradient Boosting Machines. Secondly, the underlying assumption that we should consider sequences of transactions grouped by one entity seems to fit the use case of transaction monitoring for issuers, where we have access to the entire history of transactions of each card, so the card becomes the most important (and easiest) entity to profile. It may not work as well when our model benefits from profiles over a number of different entities, such as user, device ID, IP address, or shipping address, as is usually the case with transaction monitoring for online merchants. We’ve already started working on both of these issues, so stay tuned for news sometime soon. But overall, it’s really exciting that this solution is able to achieve better results while completely side-stepping all hand-crafted features and profiles, which are costly both in terms of human and computational resources.

This work was done in collaboration with Mariana Almeida and Bernardo Branco.

Feedzai Techblog