A financial dataset is challenging in many ways. The data is usually anonymized to protect customers’ privacies. Sometimes even the column name of the tabular data is encoded, which can prevent feature engineering using domain knowledge. As required by financial regulation and laws, oftentimes the models must be interpretable, like logistic regression or tree classifiers, so that the decision process can be monitored and reviewed. And there is always noise hiding deeply in the data.
Santander customer transaction prediction challenge is a classic example of financial data modeling. As the most popular competition in Kaggle’s history, 8802 teams competed to build better models identifying which customers will make a specific transaction in the future. The RAPIDS.ai team placed 17th in the contest. In this blog, we will demonstrate how to use RAPIDS data science tools to uncover hidden patterns, extract meaningful features, and construct models that are useful for both the competition and real world applications.
The data is fully anonymized, containing 200 numeric feature variables, from
var_199, a binary
target column, and an ID column
ID_code. Submissions are evaluated on area under the ROC curve, higher the better, with 1 being the perfect score. Despite being anonymous, the data has the same structure as the real data Santander have available to solve the customer transaction prediction problem.
We use RAPIDS cuDF to read the
csv file of 200,000 rows and 202 columns within 0.5 seconds. In comparison, it is 10x faster than the pandas’ counter part. To understand the data, we extract some key statistic metrics of the dataframe. For example, we compute pairwise correlation of columns to study feature interaction. We make the following observations:
- There are no missing values in the dataframe.
- The correlation between any
varcolumns are very small. Further study shows that columns are actually independent.
- The correlation between
varcolumns is around 0.006.
varcolumns are of gaussian or bimodal distribution.
For example, the kernel density estimation (KDE) plot below shows the distribution of column
var_0 for both positive and negative samples. The distribution is largely gaussian but a second small bump is also notable, which makes it a bimodal. It is also visually evident that when
6<var_0<13 the probability of having
target=1 is low, when
14<var_0<20 the probability of having
target=1 is high and hard to tell in other region. This indicates that a tree classifier is promising which finds splits like these.
We can also measure the significance of this feature by computing the Kolmogorov-Smirnov KS statistic. Although the two curves have similar trend, the
p_value is 0.000 so we can reject the null hypothesis that positive and negative samples are drawn from the same continuous distribution. In other words, the features already have strong predictive power regarding the
We feed the data as is to a Xgboost model and train it with NVIDIA GPUs. Based on the fact that columns are independent, we intentionally set
max_depth=1so that the tree considers one column at a time and avoids learning fake feature interactions.
The validation AUC is 0.911 (leaderboard score 0.90) which is a pretty high score with a simple model and no feature engineering. An interesting fact of the competition is that all teams got this score within a day and then stuck there for two months until someone (luckily including us) broke into 0.93+ AUC. Let’s dig deeper and see how RAPIDS can make a breakthrough.
Given the fact that columns are anonymized and independent, it is next to impossible to engineer new features based on domain knowledge or explore feature interactions between different columns. Consequently, we focused on extract more information from a single column. In the following discussion, we again use the column
var_0 as an example for analysis and apply the transformation to all other columns.
One of the most common transformations of a single column is
count encoding. We used
cudf to divide the dataframe by the column
var_0 into groups and count the size of each group. Then the counts of each different value of
var_0 is merged back to the dataframe.
To verify the predictive power of such count encoding, we further calculate the mean
target value of each group that have the same count values. As shown in the following figure, there is a notable trend that a group with larger count values has lower
mean target rate .
Counting encoding of
var_0 contains information that is orthogonal to
var_0 and hence should improve our model when used together. We applied the count encoding to all columns and run the XGBoost model with both original columns and the new count encoding columns. The end-to-end running time is less than five minutes on a single GPU and the validation
AUC is improved to 0.918 from the 0.911 baseline.
Can We Make Further Progress?
Yes, but to achieve that, we need to look deeper into the
count groups. In the following figures, we plot the KDE of
var_2 with different count groups. An interesting pattern is that for all three
vars , the group
count==1 is significantly different from other groups and the variable as a whole. As a matter of fact, this pattern can be found in most of the
vars . There are several hypothesis to explain this, such as that
count==1 group is more noisy in nature, or it is an artifact of data imputation, so on so forth.
Another important observation is that the original ups and downs are more pronounced if we just get rid of the
count==1 group, which is shown in the rightmost column of the plots,
count>1 group. This pattern is more obvious when we plot the conditional likelihood, which can be found in my kaggle kernel. This pattern suggests we can make add new features by replacing the
count==1 group with
None and let XGBoost learn how to impute these values from the data optimally. Consequently, we come up with the following implementation.
The validation AUC is improved to
0.934 within 11 minutes and it could be placed top 1% of the competition. The full notebook can be found here. This solution constitutes our best single model. In our full solution, we built an ensemble with a customized neural network model and utilize data augmentation. Our final solution is only 0.002 less than the first place winner of the competition.
cudf doesn’t support parallel processing of independent columns of a dataframe. As shown in the notebook, we apply the same transformation to each column with an inefficient
for loop. Even if we launch these functions in parallel using
multiprocessing, they are still serialized at the GPU side since the
cudf calls are queued in the same
Null cuda stream. In future, we will expose the
cuda stream id to the
cudf API so that these calls go to different
cuda streams and run fully parallel.
Financial data modeling is special: it is easy to get a good baseline by applying a simple model, but it is difficult to make significant improvements with all the limitations. In this article, we utilize RAPIDS data science packages to build new features of a fully anonymous dataset solely based on its statistical pattern and achieved the state of the art result.