Statistical problems in fraud detection

Published in

Walmart Global Tech Blog

4 min readJul 17, 2020

Every day, credit card credentials are stolen by fraudsters who immediately try to purchase merchandise from many websites, including ours. This happens before the credit card owner and the bank can notice it, which means we need a fraud detection team fully dedicated to mitigating this risk.

At Walmart eCommerce, we process thousands of transactions every day and we need a team of Data Scientists, Engineers, and Product Managers to build, deploy, and maintain fraud detection algorithms.

How Data Science comes to the table?

From a product point of view, a perfect fraud detection solution needs to achieve the following.

Allow a frictionless customer experience ie accept all transactions from legitimate customers.
Stop all transactions from fraudsters.
Measure success to align with business expectations.

This is the ultimate goal of fraud detection, but we can’t know for sure who is a fraudster: thus, the second-best we can do, is to guess who is a fraudster with a very high probability. This is when Statistics comes to the table, and this is when we start to explore the different types of problems we face as Data Scientists.

What type of statistical problems are involved in fraud detection?

From a Statistics point of view, fraud detection can be modeled as a binary classification problem, where each record is a transaction and the response variable is fraud vs non_fraud. Nonetheless, doing fraud detection is a lot more than programming and deploying your favorite binary classification algorithm and it is a problem that touches different areas of Statistics from very classic to modern areas:

Class imbalance problem:

The vast majority of our customers are good customers who want to purchase merchandise from our website. This translates into a classic problem in Statistics called the class imbalance problem

2. High dimensional model selection problem:

We constantly brainstorm ideas for new features that could carry a signal of fraud. With the years, the database of possible features becomes bigger and we need to take some of them for our models. This translates into this problem for which recent methods have been developed to deal with high dimensional data.

3. Missing data imputation problem:

Every time we decline a transaction, we won’t know if it was really a fraudulent transaction or a legitimate transaction. Thus, the next time we’ll refit the models we won’t know the label. This is a classic problem which can be found here

4. Change-point detection for multivariate time series:

We have to build detection algorithms that are constantly monitoring traffic looking for anomalous patterns to keep vigilant 24/7 for any sudden change. A classic tool is a change point detection tool for several important signals. This literature is more than 100 years old and many companies have built tools for that task: a good example is Twitter’s anomaly detection tool.

5. Dealing with noisy labels:

For some transactions, many websites have a team or manual review agents to call the customers and label transactions as fraud or non_fraud. As humans commit mistakes, these labels are noisy. Measuring the level of noise and how to use them in your next model is a modern problem in Statistics.

6. What about solving a binary classification problem in a graph?

Have you ever wondered how “People you may know” product works? in a social network, you can use the topology of the graph to guess who you should know or to infer features about your friends. In fraud detection, we could do the same by linking entities: for example customers with hashes for payment instruments.

7. Deep learning and adversarial machine learning:

Deep learning models have flourished in the last decade, due to its high predictive power, and fraud detection is not the exception. Deep learning. techniques for adversarial problems are of particular interest. Fraudsters are smart, some of them with Data and Engineering backgrounds. They are continuously changing their strategies, which means our model should also evolve continuously as any patterns we have seen in the historical data are prone to change. Moreover, it would be great if we could predict the next step fraudsters could take or change the algorithms in a smart way, not just refitting with the last data collected. This is what modern methods like GANs try to model in the real world and they are very promising to figure out new ways to evolve our models continuously.

What is the future of fraud detection and statistics?

As the number of card non-present transactions keeps growing year over year and the number of fraudulent transactions keeps growing there will only be new developments in the Statistical tools to mitigate fraud risk. If we add that every year many FinTech startups are created we can only expect the development of new statistical tools we didn’t even think about before.

Statistical problems in fraud detection

How Data Science comes to the table?

What type of statistical problems are involved in fraud detection?

What is the future of fraud detection and statistics?

Written by Camilo Rivera