Overcoming Missing Values In A Random Forest Classifier

AirbnbEng
The Airbnb Tech Blog
6 min readApr 7, 2015

By Alok Gupta

No Strangers

Airbnb is trying to build a world where people can belong anywhere and there are no strangers. This helps hosts feel comfortable opening their homes and guests be confident traveling around the globe to stay with people they have never met before.

While almost all members of the Airbnb community interact in good faith, there is an ever shrinking group of bad actors that seek to take advantage of the platform for profit. This problem is not unique to Airbnb: social networks battle with attempts to spam or phish users for their details; ecommerce sites try to prevent the use of stolen credit cards. The Trust and Safety team at Airbnb works tirelessly to remove bad actors from the Airbnb community and to help make the platform a safer and trustworthy place to experience belonging.

Missing Values In A Random Forest

We can train machine learning models to identify new bad actors (for more details see the previous blog post Architecting a Machine Learning System for Risk). One particular family of models we use is Random Forest Classifiers (RFCs). A RFC is a collection of trees, each independently grown using labeled and complete input training data. By complete we explicitly mean that there are no missing values i.e. NULL or NaN values. But in practice the data often can have (many) missing values. In particular, very predictive features do not always have values available so they must be imputed before a random forest can be trained.

Typically, random forest methods/packages encourage two ways of handling missing values: a) drop data points with missing values (not recommended); b) fill in missing values with the median (for numerical values) or mode (for categorical values). While a) does not use all the available information by dropping data points, b) can sometimes brush too broad a stroke for data sets with many gaps and significant structure.

There are alternative techniques for dealing with missing values, but most of these are computationally expensive e.g. repeated iteration of random forest training to compute proximities. What we propose in this post is a one-step pre-computation method which normalises features to construct a distance metric for filling in missing values with the median of their k-nearest neighbors.

All of the features in our fraud prediction models fall into two types: a) numerical and b) categorical. Boolean features can be thought of as a special case of categorical features. Since we work in the business of fraud detection, our labels are binary: 0 if the data point is not fraud and 1 if the data point is fraud. Below are some feature transformations we wish to compare for missing value treatment.

Transformations

Interpretation

The aim of the scaling transforms Fn and Fc is two fold. First of all to make the transformation invertible so no information is lost. Secondly, to uniformly distribute the data points with fraud in the interval [0,1] for each feature. If we think of data as points in N dimensional space where N is the number of features, then the distance in each dimension between two data points becomes comparable. By comparable we mean that a distance of 0.4 in the the first dimension contains twice as many fraud data points as a distance of 0.2 in the second dimension. This enables better construction of distance metrics to identify fraud.

Imputation Using K-Nearest Neighbors

Experiment

In order to see the effect of the above feature transforms, we use the adult dataset from the UCI Machine Learning Repository and assess the performance of the model under different feature transformations and proportions of missing values. The dataset containts 32,561 rows and 14 features, of which 8 are categorical and the remaining 4 are numerical. The boolean labels correspond to whether the income level of the adult is greater than or less than $50k per annum. We divide the dataset into a training and test set in the ratio of 4 to 1 respectively.

For the first experiment we compare different models using the methodology:

Observe that M and λ are unspecified parameters — we will loop over different values of these during experimentation. The Performance of each model will be judged using Area Under Curve (AUC) scores which measures the area under the Receiver Operating Characteristic (ROC) graph (this is a plot of the true postive rate vs the false postitive rate). We will test the following nine models:

Results

Models Comparison

First we consider how the models perform for different values of [latex]M[/latex] and [latex]\lambda[/latex] with a fixed number of trees (100) in the RFC training process and fixed number of nearest neighbours (100) for NNMV imputation.

rfc_mis_vals_cdf_transforms_13_0

The graphs above display interesting patterns, some intuitive and some surprising:

Robustness Checks

Having observed the outperformance of model 8) over the other candidates, we next check how model 8) compares to the baseline as we vary i) the number of trees in the RFC training and ii) the number of nearest neigbours in the NNMV imputation. We take the fourth scenario above — where 60% of values are missing — and we chose λ=0.5.

rfc_mis_vals_cdf_transforms_16_1

The left hand plot does not suggest the performance of model 8) improves with the number of nearest neighbours used in the NNMV imputation. However, there is a consistent pattern of improved performance as the number of trees increases, plateauing after about 100 trees. The right hand plot shows how much faster it is to train a RFC with the transformed feature set. This is to be expected as, instead of exploding categorical features to many binary features in the baseline model, we keep the number of features fixed in model 8).

Efficiency Implications

Consider the ROC curves for one of the scenarios above, say, where the number of trees is 100 and the number of nearest neighbors used is 100.

rfc_mis_vals_cdf_transforms_19_0

The improvement of the ROC curve suggests, for example, that holding recall fixed at 80%, say, the false positive rate falls from 26% to 24%. Suppose each day we are scoring 1 million events, 99% of which are non-fraud, each flagged event needs to be manually reviewed by a human, and each review takes 10 seconds. Then the aforementioned decrease in the false positive rate can save reviewing 1,000,000 x 0.99 x 0.02 = 19,800 events or 19,800 / (6 x 60) = 55 hours of reviewing per day! This is why even single digit or decimal digit improvements in the auc score of a RFC can have a dramatic effect on a department’s efficiency.

Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData

Originally published at nerds.airbnb.com on April 7, 2015.

--

--

AirbnbEng
The Airbnb Tech Blog

Creative engineers and data scientists building a world where you can belong anywhere. http://airbnb.io