Understanding — Solving the “false positives” problem in fraud Prediction.

Anil Kemisetti
It’s all about feature engineering
4 min readMar 19, 2018
Original Paper https://arxiv.org/abs/1710.07709

The key idea of this paper ( Original Paper https://arxiv.org/abs/1710.07709 ) is to use a technique called “Deep Feature Synthesis.” Pg[1]

Deep Feature Synthesis :

So, what is deep in DFS?
Features are created using mathematical operations leveraging the relationships between data points of the dataset. These are called “Primitive” features. Deep features are generated by composing above derived primitive features. This technique is built by developers at MIT and open sourced. https://www.featuretools.com/

Here is how this paper describes the techniques.

Pg[6,7]

Mathematical intuition behind DFS is explained in this paper. http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf

Next, about the data and data preparation :

Rest of the information can be obtained from the paper. Pg[4]

Learning the model :

Authors have used Random forest to train the model. Followed the usual process for the algorithm. Pg[7]

Evaluating the model :

Authors were interested in evaluating the performance w.r.t to the false positives. They have used precision and F1 scores and presented the results. Pg[9]

For rest of the result see the original paper.

Evaluating the financial loss due to false positive :

In addition to normal evaluation, authors have performed a Financial Evaluation and compared the model. To calculated the authors considered 50% of all the false positives as failed and try to estimate the results. Used these results to decide on the threshold. Pg[10]

I did not find a proper justification for the 50% in the paper. But I like the idea of arriving at a financial cost of false positive predictions. It makes the choice of the threshold more interpretable. Here are the results

Model’s Inference performance for Real-time Predictions :

One more approximation made is “the loss of precision using old data for derived aggregate features would be in the acceptable range.”

So, during inference the real-time transactional features would be merged with old aggregated features which are calculated using a batch process. This is key for the claim by the authors to use this model for real-time inference. Here is the outline of this approach from the paper. Pg[8]

Here is the justification :

Authors claim that the loss of precision when a 35 day old aggregate data was used was only 0.005. They have used an option called “approximate” in Featuretools to simulate 1 day, 7 days, 21 days and 35 days. Pg[12]

Authors have assumed the transactions would have achieved a level of stationary transactional behavior. This might be true with customers having a good amount of history. I doubt if this technique would be successful with new customers.

Below, I will show a picture from the paper which depicts where this approach would create a difference. Pg[5]

Finally, it was briefly mentioned in the paper about selecting important features based on number of training examples a feature separates. It would have been good to give more info on the quality of the generated synthetic features and details of any additional dimensionality reduction techniques were used.

--

--