Fighting Fraud with Machine Learning at Remitly

Published in

Remitly

7 min readFeb 9, 2022

Author(s): Jake Weholt, Fangyan Chen, Matthew Drury

Remitly customers send billions of their hard-earned dollars home each year. Hidden amongst these customers are bad actors attempting to defraud Remitly and the customers we work hard to protect. To help combat fraudulent activity, Remitly deploys a variety of methods, some of which rely on sophisticated machine learning models. This blog post will focus on the machine learning tools and techniques used by Remitly’s machine learning team to protect our customers.

Framing the problem

While many machine learning tools are available to fight fraud, we’ve chosen to frame fraud mitigation as a binary classification problem. Transactions are labeled either fraud (1) or not fraud (0), and a variety of features are used for machine learning models to learn the differences between the two.

Fraud mitigation at Remitly is an entire system with many layers, one of which contains a machine learning model that produces the probability a transaction is fraudulent, i.e.

where X is the feature vector describing each transaction, y is the binary target representing whether the transaction is fraudulent, and f is the machine learning model. The output of this model is then used as input into our mitigation decisions.

Clean Fraud Labels

Clean labels are the key to developing valuable fraud machine learning models. Unfortunately, gathering clean labels can be quite challenging because most of the time we will never know how a transaction should truly be labeled. For example, some fraudulent transactions will slip through and never be labeled as fraudulent, depriving our model of valuable signal. Additionally, any transaction that Remitly declines will never actually be sent, which means it will never be caught/labeled.

Noisy labels introduce unique challenges to the right hand side of the equation introduced above: uncertainty around the correctness of the target variable Y itself.

Thankfully there are ways to navigate these challenges despite the difficulties outlined above. One positive, albeit unintuitive observation is that certainty across all labels is not uniform. In effect, we are more certain about the correctness of some positive labels over others. Using labels we are highly certain about, we can increase certainty around labels that we are uncertain about. In other words, we can use a small dataset that we are highly confident about to inform us about the overall dataset that we are less certain about. This is a process known as stratified sampling, and can be used when you can ensure that your small dataset is an unbiased representation of your larger dataset. Below is an example of this process in action.

Feature Engineering

Feature engineering is the process of transforming raw data into model features that increase model performance. The goal is to build features that help our models learn the underlying structure of the problem we are trying to solve. In our case, that means finding/creating features that separate our classes (“is fraud” vs. “is not fraud”) so that our models can successfully distinguish between the two.

Feature engineering can be simple, like taking the square root of a particular series, or very complex, like combining data from multiple sources or imputing missing values. Regardless of how features are engineered, the overall goal is the same: create features that reduce fraudulent transactions.

Below is an example of how simple feature transforms can improve model performance:

Above we see labels plotted as a function of raw date. There seems to be some sort of pattern, but it isn’t clear what the pattern is.

By plotting the labels as a function of month instead of raw date, and once we do this, the pattern becomes much clearer.

If we go a step further and build individual models using these features (one model for “raw date” only, and another model for “month” only), we can see that using “month” as a feature instead of “raw date” gives us a considerable performance lift.

Feature engineering is one of the powerful tools we use to capture complicated customer dynamics at Remitly, where we serve a broad customer base whose behavior naturally varies. Fraudsters tend to camouflage themselves among legitimate customers and rapidly shift their behaviors to avoid detection. These factors create unique challenges for feature engineering:

It’s difficult to find patterns from readily available data that can distinguish fraudsters from legitimate customers.
Such patterns are non-stationary.

Looking back on the equation introduced at the beginning of this post, it’s not difficult to see that these challenges form around finding the best feature vector X to distinguish between fraudulent transactions and legitimate ones. To address these challenges, Remitly invests heavily in feature engineering.

Remitly’s machine learning engineers spend the majority of their time rapidly developing, simulating, and experimenting to improve upon the current suite of features. Rapid development is the key to capturing emerging fraud trends, and thorough experimentation ensures we are developing features that provide lift, without overfitting. Additionally, through feature experimentation and simulation we can estimate how our models will behave in production. This gives us better insight into the tradeoffs we are making when deploying a new model version.

Metrics

Understanding and measuring how our customers are impacted by machine learning is the key to managing the tradeoffs associated with deploying our models.

In fraud mitigation, there is a tradeoff between a smooth customer experience and fraud losses. Stricter fraud enforcement leads to increased false positives, pushing too many non-fraudsters into a potentially bad experience. These experiences could lead to customers canceling their transactions or leaving Remitly. Less strict fraud enforcement leads to increased false negatives, which can cause headaches for the customers who were defrauded and increases fraud losses for Remitly.

To measure these tradeoffs we use traditional metrics such as precision (what proportion of the flagged transactions are actually fraudulent?) and recall (of all fraudulent transactions, what proportion did we catch?).

In general, we ignore accuracy as a model metric. Using accuracy to measure model success for binary classification models with imbalanced data is a misleading view into model performance. For instance, if our data set includes 16 transactions, 15 of which are non-fraudulent, simply predicting “not fraud” for every transaction gives us 93.75% accuracy. This sounds great, but we didn’t actually catch any fraudsters. Below, we show this example in greater detail, including the precision and recall scores which can be used to catch this issue.

The Rare Event Problem

Fraudulent activity is rare in comparison to non-fraudulent activity. This behavior disparity leads to class imbalance, which is a fancy term that means positive fraudulent cases out number negative cases by a very large amount in the data.

When datasets are distributed this way, models have far fewer data points of comparison to find distinguishing patterns between negative and positive classes. Imagine a dataset with one hundred fraudulent transactions and 50,000 non-fraudulent transactions. The ability to find distinguishing patterns is limited by the amount of information contained in one-hundred fraudulent data points. Adding any more non-fraudulent points wouldn’t help when searching for distinguishing patterns useful to a model.

Rare classes of interest make for difficult machine learning problems, because small probabilities are difficult to estimate accurately. Most (but not all) machine learning classifiers, when trained on very imbalanced data, will bias toward underestimating the probability of the rare class (see King and Zeng: Logistic Regression in Rare Events Data). This model behavior manifests as a model that predicts “is not fraud” a lot more than it predicts “is fraud”. To make matters trickier, if we are doing our jobs properly and providing a strong defensive barrier between Remitly and fraudsters, observed fraud will get rarer over time because (a) as our models improve, fewer fraudulent transactions get past our system, and (b) fraudsters gravitate toward easy paydays, so strong defenses force fraudsters to commit fraud elsewhere (a problem called adverse selection in some disciplines).

Several steps can be taken to mitigate the risks from class imbalance:

Consider models that optimize proper scoring rules (gradient boosting is particularly useful here)
Resample/reweight your data to even out the count of positive/negative labels.
Choose the appropriate model evaluation metrics.

Summary

In this post we explored the challenges of using machine learning to fight fraud, specifically the difficulties of generating clean labels for training, the value of feature engineering to create better models, the importance of using the correct evaluation metrics for measuring model performance, and finally the rare event problem and why it makes building valuable models difficult. Thank you for following along and we hope this post gives you inspiration to fight fraud in your own domain!