Confidence Splitting Criterions Can Improve Precision And Recall in Random Forest Classifiers
By Alok Gupta
The Trust and Safety Team maintains a number of models for predicting and detecting fraudulent online and offline behaviour. A common challenge we face is attaining high confidence in the identification of fraudulent actions. Both in terms of classifying a fraudulent action as a fraudulent action (recall) and not classifying a good action as a fraudulent action (precision).
A classification model we often use is a Random Forest Classifier (RFC). However, by adjusting the logic of this algorithm slightly, so that we look for high confidence regions of classification, we can significantly improve the recall and precision of the classifier’s predictions. To do this we introduce a new splitting criterion (explained below) and show experimentally that it can enable more accurate fraud detection.
Traditional Node Splitting Criterions
A RFC is a collection of randomly grown ‘Decision Trees’. A decision tree is a method for partitioning a multi-dimensional space into regions of similar behaviour. In the context of fraud detection, identifying events as ‘0’ for non-fraud and ‘1’ for fraud, a decision tree is binary and tries to find regions in the signal space that are mainly 0s or mainly 1s. Then, when we see a new event, we can look at which region it belongs to and decide if it is a 0s region or a 1s region.
Typically, a Decision Tree is grown by starting with the whole space, and iteratively dividing it into smaller and smaller regions until a region only contains 0s or only contains 1s. Each final uniform region is called a ‘leaf’. The method by which a parent region is partitioned into two child regions is often referred to as the ‘Splitting Criterion’. Each candidate partition is evaluated and the partition which optimises the splitting criterion is used to divide the region. The parent region that gets divided is called a ‘node’.
Confidence Splitting Criterion
After these tweaks to the algorithm we find an insignificant change to the runtime of the Scikit-Learn routines. The Python code with the new criterion looks something like this:
from sklearn.ensemble import RandomForestClassifier
# using [C_0,C_1] = [0.95,0.95]
rfc = RandomForestClassifier(n_estimators=1000,criterion='conf',conf=[0.95,0.95])
pred = rfc.predict_proba(x_test)
For more details on the Machine Learning model building process at Airbnb you can read previous posts such as Designing Machine Learning Models: A Tale of Precision and Recall and How Airbnb uses machine learning to detect host preferences. And for details on our architecture for detecting risk you can read more at Architecting a Machine Learning System for Risk.
To test the improvements the Confidence splitting criterion can provide, we use the same dataset we used in the previous post Overcoming Missing Values In A Random Forest Classifier, namely the adult dataset from the UCI Machine Learning Repository. As before the goal is predict whether the income level of the adult is greater than or less than $50k per annum using the 14 features provided.
We tried 6 different combinations of [C0,C1] against the baseline RFC with Gini Impurity and looked at the changes in the Precision-Recall curves. As always we holdout a training set and evaluate on the unused test set. We build a RFC of 1000 trees in each of the 7 scenarios.
Observe that C0=0.5 (yellow and blue lines) offers very little improvement over the baseline RFC, modest absolute recall improvements of 5% at the 95% precision level. However, for C0=0.9(green and purple lines) we see a steady increase in recall from at precision levels of 45% and upwards. At 80% precision and above, C0=0.9 improves recall by an absolute amount of 10%, riing to 13% at 95% precision level. There is little variation between C1=0.9(green line) and C1=0.99 (purple line) for C0=0.9 although [C0,C1]=[0.9,0.9] (green line) does seem to be superior. For C0=0.9 (pale blue and pink lines), the improvement is not so impressive or consistent.
It would be useful to extend the analysis to compare the new splitting criterion against optimising existing hyper-parameters. In the Scikit-Learn implementation of RFCs we could experiment with min_samples_split or min_samples_leaf to overcome the scaling problem. We could also test different values of class_weight to capture the asymmetry introduced by non-equal C0 and C1.
More work can be done on the implementation of this methodology and there is still some outstanding analytical investigation on how the confidence thresholds Cj tie to the improvements in recall or precision. Note however that the methodology does already generalise to non binary classifiers, i.e. where j=0,1,2,3,…. It could be useful to implement this new criterion into the Apache Spark RandomForest library also.
For the dataset examined, the new splitting criterion seems to be able to better identify regions of higher density of 0s or 1s. Moreover, by taking into account the size of the partition and the probability of such a distribution of observations under the null hypothesis, we can better detect 1s. In the context of Trust and Safety, this translates into being able to more accurately detect fraudulent actions.
The business implications of moving the Receiver Operating Characteristic outwards (equivalently moving the Precision-Recall curve outwards) have been discussed in a previous post. As described in the ‘Efficiency Implications’ section of Overcoming Missing Values In A Random Forest Classifier post, even decimal percentage point savings in recall or precision can lead to enormous dollar savings in fraud mitigation and efficiency respectively.
Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData
Originally published at nerds.airbnb.com on October 20, 2015.