Data Science Journeys: Fraud Detection

7 min readJan 21, 2019

There is a common chasm that we as Data Science practitioners working with companies face in virtually every project; how to go from a business problem, to a data science solution, or to data science product. This gap is somewhat obvious, yet elusive and easily overlooked, and it seems that this duality stems from the way that we start learning DS. Looking at most tutorials we have a initial dataset, and a specific goal with this dataset. The data and the problem definition are airtight. For example in the famous House Pricing regression data we have the task of predicting the price of the house based on numerous attributes, and the success of our model will be evaluated under some error metric, say logloss or something of the sort. From here on, we can use our expertise on ML to solve these problems.

However, note that these tasks are already stated under a ML context. In a real world setting, we do not have these immutable definitions for the problem. Usually we have something very different, a problem the client has, and expectation of something that will happen if such problem is solved, a business objective. Looking back at the housing problem we could ask: what is the real problem that whoever gave us this data was trying to solve?, maybe we are talking about a business that buys homes, improves them and sells these improved houses for a profit, and maybe their objective is to determine, based on data, which factors give more value to houses such that they focus on those to maximize their profits. Under this new knowledge, we see that only predicting the price of a house it’s hardly the end goal we should keep in mind. and we are facing a gap.

Amending the chasm between business objectives and data science objectives marks the difference between projects that end well and projects that don’t. The reason being that a DS solution that is not aligned with the business objectives, and does not attack directly the problem that the business has, it’s doomed to be little more than a failed DS project, no matter how good the data science/machine learning development is.

This small series of articles has the aim of giving a practical look to data science under the umbrella of a business use case. We want to explore with you how to go from business problem to data science product, and what considerations have to be accounted. This case is adapted and based on the data provided by Edgar Alonso Lopez-Rojas, Ahmad Elmir, and Stefan Axelsson In their paper Paysim: A Financial Mobile Money Simulator for Fraud Detection¹. the data can be directly downloaded through Kaggle.

The story starts like this…

Our client is a widely recognized mobile wallet company based on an undisclosed African country. This company handles around 6 million transactions a month, and they take seriously the protection of the customers assets stored in their virtual wallets, that is why fraudulent transactions is a serious concern for our client. Our client estimates that the proposed modus operandi of an attacker is simple: an attacker seizes control of the wallet of the victim and transfers money to another wallet, in a series of one or more fraudulent transactions, the money ends up being withdrawn, and effectively stolen. This situation occurs often enough that in a single month, the staggering amount of 1200M is lost to fraud.

Effective withdrawals in fraudulent transactions.

Keeping this problem in mind, the client has come to us with two main concerns.

They would like to automate the process of fraud detection, as the client has none for the moment. In their status quo, frauds are reported after they have been committed, and after some time, the reports are analyzed to see if the claim was a real fraud, a process that is struggling with the current amount of fraudulent claims. Based on the automatic flagging of fraud, the client wants to implement a process that could allow them to quickly review possible frauds, holding the transaction in the meantime, while an expert of decides whether to approve or cancel the pending transaction. Keeping in mind that the experts that review cases are scarce, it is paramount that the system does not flag too many legitimate transactions as fraud.
With a similar objective, their department of fraud analysis wants us to help them understand what factors characterize a fraudulent transaction, factors that would help them develop different measures or policies to preemptively reduce fraud.

The client has given data describing transactions performed in their platform, which they think will reveal valuable information about the patterns for fraudulent and legitimate transactions.

This series is divided in (tentatively) three different articles, each one with a particular objective under the umbrella of the broad business case.

Give me reasons (link here), where we present the solution we obtained in tackling the client objectives. This means developing a model for automating fraud detection and explaining what may characterize a fraudulent or non fraudulent transaction. In addition we describe what actionable insights or policies can be derive from the analysis in terms of their objective, and alongside these insights, we evaluate how difficult is for an attacker to elude these new measures. Details about the different models and techniques used are provided as well. We also emphasize the reasons to keep in mind the objectives of the client and the business context whilst developing the models.
Anticipating the fraudsters (link here) follows the story from the first chapter, and starts from the notion that fraudsters will react to the implementation of the model, and such new behavior will impact the performance of the model. So, how can we develop a more robust model?
What to do with the model?, (in progress) where we discuss different ways to deploy the models developed in the first chapter, contingent upon the client requirements and existing technological infrastructure.

Despite that these articles are by no means an extensive picture of all the challenges that a real world mobile fraud detection case would encompass, they cover a good portion of the issues, and may prove useful for one who looks to integrate business and technical in their problem-solving toolbox.

As final note of the introduction, this work is a collaboration between Juan Diego Bermeo and I, so some of the articles will be posted from his account and some from mine. Regardless, you can find in this article the links to the rest.

A little bit of data context

For those of you who are not familiarized with the data, we would like to introduce first the context and data that we are using through this case. The dataset represents transactions during a one month period of a mobile money service or electronic wallet in a non-disclosed African country. We say represents, because in reality the dataset comes from a simulation, called Paysim, which nonetheless has the intent to be representative of real transactions. In addition to the paper mentioned some paragraphs back, we also used Edgar’s masters dissertation thesis document² . We will use information from this paper as proxy of domain knowledge in a normal business context, which is essential to understand the appropriateness and efficacy of the data products built.

For our practical purpose, we will assume the data as real, and try to establish a setting as close to real world practice as possible.

Now onto the data, we have a file with approximately 6 million records corresponding to a month of data capture. Each record characterizes a single transaction, including details from the accounts of origin and destination, together with their respective balances. All records are labelled as either a fraud or a legitimate transaction. In addition, these transactions can occur between two types of roles in the system, customers and merchants, think of customers as ordinary people using the platform and merchants as ATMs that allow to deposit or withdraw money. Let us describe briefly each of the fields that we will encounter in the main datafile.

step describes the hour of the month, starts at 1and goes up to 720, that corresponds to 30 days.

type denotes the five varieties of transactions that occur in the mobile platform, these are CASH_OUT, CASH_IN, DEBIT, PAYMENT, TRANSFER. Each of these types warrants its own explanation, citing directly the author of the data:

CASH-IN is the process of increasing the balance of account by paying in cash to a merchant, like doing a deposit to the account.
CASH-OUT is the opposite process of CASH-IN, it means to withdraw cash from a merchant which decreases the balance of the account.
DEBIT is similar process than CASH-OUT and involves sending the money from the mobile money service to a bank account.
PAYMENT is the process of paying for goods or services to merchants which decreases the balance of the account and increases the balance of the receiver.
TRANSFER is the process of sending money to another user of the service through the mobile money platform.

amount is the money transferred in the transaction, the amount goes from a origin account to a destination account.

nameOrig identifies the account originating the transaction. Ids are prefixed by a C or a M for customers and merchants respectively.

oldbalanceOrg is the balance of the origin account before the transaction.

newbalanceOrig is the balance of the origin account after the transaction.

In the same line, the data also contains the fields nameDest, oldbalanceDest and newbalanceDest.

isFraud is the indicator of the illegal activity, 1 for fraudulent and 0 for legitimate transaction. Only 0.12%, or 1 in 833 transactions are identified as fraud, a scarcity that brings additional complexity to this case.

isFlaggedFraud marks whether a transaction tries to move an amount superior to 200.000, which is a policy implemented by the client to identify and discourage fraud. However, there is no guarantee that such flagging of transaction works.

[1] Lopez-Rojas, Edgar Alonso & Elmir, Ahmad & Axelsson, Stefan. (2016). PAYSIM: A FINANCIAL MOBILE MONEY SIMULATOR FOR FRAUD DETECTION. link here

[2]Lopez-Rojas, E. A. (2016). Applying Simulation to the Problem of Detecting Financial Fraud (PhD dissertation). Karlskrona. Retrieved from http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932

Data Science Journeys: Fraud Detection

The story starts like this…

A little bit of data context

Written by David Gamba