Can you identify who will make a transaction?

Financial data modeling with RAPIDS.

Published in

RAPIDS AI

5 min readJul 3, 2019

A financial dataset is challenging in many ways. The data is usually anonymized to protect customers’ privacies. Sometimes even the column name of the tabular data is encoded, which can prevent feature engineering using domain knowledge. As required by financial regulation and laws, oftentimes the models must be interpretable, like logistic regression or tree classifiers, so that the decision process can be monitored and reviewed. And there is always noise hiding deeply in the data.

Santander customer transaction prediction challenge is a classic example of financial data modeling. As the most popular competition in Kaggle’s history, 8802 teams competed to build better models identifying which customers will make a specific transaction in the future. The RAPIDS.ai team placed 17th in the contest. In this blog, we will demonstrate how to use RAPIDS data science tools to uncover hidden patterns, extract meaningful features, and construct models that are useful for both the competition and real world applications.

Data Exploration

The data is fully anonymized, containing 200 numeric feature variables, from var_0 to var_199, a binary target column, and an ID column ID_code. Submissions are evaluated on area under the ROC curve, higher the better, with 1 being the perfect score. Despite being anonymous, the data has the same structure as the real data Santander have available to solve the customer transaction prediction problem.

We use RAPIDS cuDF to read the csv file of 200,000 rows and 202 columns within 0.5 seconds. In comparison, it is 10x faster than the pandas’ counter part. To understand the data, we extract some key statistic metrics of the dataframe. For example, we compute pairwise correlation of columns to study feature interaction. We make the following observations:

There are no missing values in the dataframe.
The correlation between any var columns are very small. Further study shows that columns are actually independent.
The correlation between target and most var columns is around 0.006.
The var columns are of gaussian or bimodal distribution.

For example, the kernel density estimation (KDE) plot below shows the distribution of column var_0 for both positive and negative samples. The distribution is largely gaussian but a second small bump is also notable, which makes it a bimodal. It is also visually evident that when 6<var_0<13 the probability of having target=1 is low, when 14<var_0<20 the probability of having target=1 is high and hard to tell in other region. This indicates that a tree classifier is promising which finds splits like these.

We can also measure the significance of this feature by computing the Kolmogorov-Smirnov KS statistic. Although the two curves have similar trend, the p_value is 0.000 so we can reject the null hypothesis that positive and negative samples are drawn from the same continuous distribution. In other words, the features already have strong predictive power regarding the target column.

We feed the data as is to a Xgboost model and train it with NVIDIA GPUs. Based on the fact that columns are independent, we intentionally set colsample_bytree=0.05 and max_depth=1so that the tree considers one column at a time and avoids learning fake feature interactions.

The validation AUC is 0.911 (leaderboard score 0.90) which is a pretty high score with a simple model and no feature engineering. An interesting fact of the competition is that all teams got this score within a day and then stuck there for two months until someone (luckily including us) broke into 0.93+ AUC. Let’s dig deeper and see how RAPIDS can make a breakthrough.

Feature Engineering

Given the fact that columns are anonymized and independent, it is next to impossible to engineer new features based on domain knowledge or explore feature interactions between different columns. Consequently, we focused on extract more information from a single column. In the following discussion, we again use the column var_0 as an example for analysis and apply the transformation to all other columns.

One of the most common transformations of a single column is count encoding. We used cudf to divide the dataframe by the column var_0 into groups and count the size of each group. Then the counts of each different value of var_0 is merged back to the dataframe.

To verify the predictive power of such count encoding, we further calculate the mean target value of each group that have the same count values. As shown in the following figure, there is a notable trend that a group with larger count values has lower mean target rate .

Counting encoding of var_0 contains information that is orthogonal to var_0 and hence should improve our model when used together. We applied the count encoding to all columns and run the XGBoost model with both original columns and the new count encoding columns. The end-to-end running time is less than five minutes on a single GPU and the validation AUC is improved to 0.918 from the 0.911 baseline.

Can We Make Further Progress?

Yes, but to achieve that, we need to look deeper into the count groups. In the following figures, we plot the KDE of var_0 , var_1 and var_2 with different count groups. An interesting pattern is that for all three vars , the group count==1 is significantly different from other groups and the variable as a whole. As a matter of fact, this pattern can be found in most of the vars . There are several hypothesis to explain this, such as that count==1 group is more noisy in nature, or it is an artifact of data imputation, so on so forth.

Another important observation is that the original ups and downs are more pronounced if we just get rid of the count==1 group, which is shown in the rightmost column of the plots, count>1 group. This pattern is more obvious when we plot the conditional likelihood, which can be found in my kaggle kernel. This pattern suggests we can make add new features by replacing the count==1 group with None and let XGBoost learn how to impute these values from the data optimally. Consequently, we come up with the following implementation.

The validation AUC is improved to 0.934 within 11 minutes and it could be placed top 1% of the competition. The full notebook can be found here. This solution constitutes our best single model. In our full solution, we built an ensemble with a customized neural network model and utilize data augmentation. Our final solution is only 0.002 less than the first place winner of the competition.

Further Improvement

Currently cudf doesn’t support parallel processing of independent columns of a dataframe. As shown in the notebook, we apply the same transformation to each column with an inefficient for loop. Even if we launch these functions in parallel using multiprocessing, they are still serialized at the GPU side since the cudf calls are queued in the same Null cuda stream. In future, we will expose the cuda stream id to the cudf API so that these calls go to different cuda streams and run fully parallel.

Conclusion

Financial data modeling is special: it is easy to get a good baseline by applying a simple model, but it is difficult to make significant improvements with all the limitations. In this article, we utilize RAPIDS data science packages to build new features of a fully anonymous dataset solely based on its statistical pattern and achieved the state of the art result.