Cost-sensitive classification in fraud prevention

Published in

Mercado Libre Tech

7 min readNov 4, 2019

If there is one topic that can make a difference in your career as a data scientist, it is the cost-sensitive classification. In machine learning, it is the process of making a decision considering the likelihood of an outcome and the costs of making errors.

Let’s clarify this with an example. Say you are walking alone in the savannah trying to reach the next village. Suddenly, you hear a noise coming from behind. It is very likely a noise caused by the wind but what if it is a lion lurking in the bushes? If you decide it is a mortal danger and grab your weapon, then two outcomes are possible: if you are wrong, nothing happened except the explosion of the amount of adrenaline in your body and if right, then you have a chance to survive. However, if you decide it is not a mortal danger continuing to walk and you are wrong, you are dead. So we have a classification problem with highly different results in terms of type I or type II error.

Of course, it’s highly unlikely you will face a mortal danger while walking in your city (well, kind of), but this is exactly the same conceptual dilemma we have whenever we try to spot a fraud during a customer’s purchase in our platform here at Mercado Libre. If you classify a purchase as a fraud and you are wrong, you have lost the commission fee and caused a bad experience to a good customer. If you classify a purchase as legitimate and you are wrong, you have lost the entire amount of the transaction.

How do you handle this tough dilemma in the most optimal way? This article is about how we try to solve this dilemma in our fraud detection department. But first, we need to define the concept of utility and utility matrix.

The utility and the utility matrix

Let’s make a conceptual experiment. Suppose we receive a payment for a transaction and we can take only two actions, either accepting the payment or rejecting it. If we accept it and it is legitimate, we earn, say, +2. If it is a fraud we have a loss of -100. If we reject the payment and it is a fraud we earn 0. If we reject the payment and the payment was genuine, the result is a loss in which case we consider -2 because of the opportunity cost and the bad experience we are causing to our customer. We can sum up this information in the following utility matrix:

Now imagine we can repeat the same classification quite a number of times, say N times, where N is a large number. How many points we have accumulated? According to our confusion matrix

where Tp, Fp, Fn, and Tn are respectively the numbers of true positive, false positive, false negative and true negative we have accumulated a score S.

If we divide both sides of the previous equation by N and we take the limit to infinity, we get:

In other words, the average score <S> is the arithmetic average of all possible outcomes. We call the average score the “utility function” U and the general formula is:

where k is the number of possible outcomes and u_i and P_i are respectively the score and the probability of the outcome i. We can use the utility function in order to calibrate our classifiers.

Bayesian minimum risk method

The simpler approach we can take is the Bayesian minimum risk method. If we apply it to the previous example, we can split the utility function into two parts: U+ and U- i.e. respectively, the utility of classifying a payment as a fraud, and the utility of classifying it as genuine. Namely, according to the previous example

If we plot the utility functions U+ and U- in the plane P, U we can see that in the point P=4/104 , U+ becomes greater than U-. That defines our rejection zone. If P> 4/104 ~ 0.038…, then we reject the payment as a fraud; otherwise, we accept it as a genuine payment. This is very far from the 0.5 threshold that a standard classifier would consider. The main takeaway point is that our model is providing us with P, namely the likelihood that a payment is a fraud. The utility matrix gives us the threshold where we start rejecting payments.

The Bayesian minimum risk method is easy to generalize to many possible actions. For example, if we want to add an extra option like a manual review of the payment we do the following: for every action we can take, we compute the utility function. We sort the resulting utilities in descending order and pick the action associated with the first utility.

However, the Bayesian minimum risk method has some drawbacks. The probability P we get from the model is actually a ranking that we are interpreting as a probability. Let’s analyze a simple case. We have a problem to solve: the fraud classification of an incoming payment. We build a dataset as best as we can to map our problem into the real world and we train the classifier. A lot of things can be inaccurate, for instance, the censorship bias may limit our representation of the real world due to unlabeled rejected payments. In other words, our model gives us an estimate of the true probability. Relying on this assumption can give us a suboptimal solution. Can we do any better?

The ROC curve and the utility

If we look at the utility function we can rewrite it in the following way:

where tpr is the true positive rate and fpr is the false positive rate. F is the number of frauds in our dataset and L is the number of legitimate payments. We can solve the previous equation in terms of fpr and tpr and we get:

This equation represents a straight line in the tpr, fpr plane where we measure the ROC area. We can plot the previous equation letting U=0 and we get the calibration line. All the points of the ROC curve lying on this line are giving us 0 utility. All points above render a positive utility while all points below render a negative one. Therefore, our optimal cut is where the ROC curve is at the maximal distance from the U=0 calibration line. In the following image, we can see the U=0 calibration line as a red line.

Calibration of a toy model with Roc and utility

The green line is the line where the distance between the ROC curve and the U=0 calibration line is maximal and U=U_max. This point gives us the optimal threshold for our classifier. This optimization technique provides us with useful new insight about the ROC curve. Both the area and even the shape of the ROC curve are important factors to take into account. We could have models with suboptimal AUC ROC that are performing a better cost-sensitive classification due to a better shape of the ROC curve relative to the calibration line. This means that a better option in the evaluation of the convergence of a model is to look at the U=U_max value rather than the ROC area.

The last takeaway point: the ROC curve and calibration U=0 line are avoiding the limitation of the hypothesis that P is the actual probability of an outcome because the ROC curve is insensitive to relative weights between positives and negatives. Yet, we rely on the fact that our utility is representative of the real problem we are trying to solve, namely the ratio of positives and negatives. Evidently, this is always a simpler problem to solve than getting a labeled test set that is a faithful representation of a real-world problem.

So how do we solve to lion dilemma with our method? Well, let’s think for a moment. A possible utility matrix could be

So the utility of predicting lion is:

and the utility of predicting wind is:

For this utility matrix, we can see that there is no value of P where we have the condition u_ > u+ . So this case is a no brainer: just in case, run!

For the brave ones who arrived down to the point below, here you can find some enlightening articles where to dig further:

Correa Bahnsen, Alejandro & Stojanovic, Aleksandar & Aouada, Djamila & Ottersten, Björn. (2013). Cost-Sensitive Credit Card Fraud Detection Using Bayes Minimum Risk. Proceedings — 2013 12th International Conference on Machine Learning and Applications, ICMLA 2013. 1. 333–338. 10.1109/ICMLA.2013.68.

Retrieved from https://www.researchgate.net

Kruchten, B. (2016, January 27). Machine Learning meets Economics. Retrieved

Retrieved from https://blog.mldb.ai

Cost-sensitive classification in fraud prevention

The utility and the utility matrix

Bayesian minimum risk method

The ROC curve and the utility

Written by Emanuele Luzio, Ph.D.