New Kaggle Competition and Imputing NA’s
Last week, I formed a team with a fellow Kaggle-r to to tackle the BNP Paribas Cardif Claims Management competition. Having worked in the insurance industry for almost 10 years, I thought that I might have a slight edge.
That thought ended quickly as soon as I downloaded the data. Unlike data in the real world, the data that we are given for the BNP competition is thoroughly anonymized. In other words, any value from industry expertise has been completely and intentionally wiped out. Doh!
The training data consists of 131 obfuscated features labeled v1 through v131. Most of the features are numeric and I suppose could be anything from claimant’s age to number of claimants involved to initial reserve amount. For the handful of categorical features, we are only told that they are not ordered variables and are labeled as “A”, “B”, “C”, …
However, similar to data in the real world, the data for BNP has A LOT of missing values and blanks.
Some of the early scripts shared on Kaggle use -1 to impute NA’s for all of the available features, turn categorical values into numbers, and then dump the mostly unprocessed data into advanced machine learning methods. This could be a good initial attempt to use as a benchmark. Going deeper into the data exploration piece, there are probably better ways of imputing NA’s.
For numerical features, we probably don’t want the relative size of -1 to change the model fit. For categorical features, we are assuming an ordinal relationship by turning them into numbers, which does not exist based on the competition’s data description. More generally speaking, there is no one best way to impute missing values. The imputing method should probably be different depending on the specific feature being considered.
As an example, it might make sense to assume the median for a missing value that represents age. At the same time, it might also make sense to assume maximum values for empty features that represent initial reserve amounts if claims that have NA’s here tend to be extremely large and complicated. Because we have no idea what any of the features are, it might be worth trying out a few different imputing methods on each feature.
For the numerical features, the script can be fairly simple. The typical methods of imputing use the feature’s minimum value, maximum value, median (or mean), or 0.
For categorical features, the concept is more complicated. Missing values can be replaced by the most frequent category (mode), a random category based on the non-missing training sample, a random category based on a different distribution, or a completely separate category (“-1”).
In this competition with so many missing values and a very small spread in leaderboard scores, the imputing method used might make a difference.