A financial dataset is challenging in many ways. The data is usually anonymized to protect customers’ privacies. Sometimes even the column name of the tabular data is encoded, which can prevent feature engineering using domain knowledge. As required by financial regulation and laws, oftentimes the models must be interpretable, like logistic regression or tree classifiers, so that the decision process can be monitored and reviewed. And there is always noise hiding deeply in the data.

Santander customer transaction prediction challenge is a classic example of financial data modeling. As the most popular competition in Kaggle’s history, 8802 teams competed to build better models identifying which customers will make a specific transaction in the future. The RAPIDS.ai team placed **17th in **the contest. In this blog, we will demonstrate how to use RAPIDS data science tools to uncover hidden patterns, extract meaningful features, and construct models that are useful for both the competition and real world applications.

# Data Exploration

The data is fully anonymized, containing 200 numeric feature variables, from `var_0`

to `var_199`

, a binary `target`

column, and an ID column `ID_code`

. Submissions are evaluated on area under the ROC curve, higher the better, with 1 being the perfect score. Despite being anonymous, the data has the same structure as the real data Santander have available to solve the customer transaction prediction problem.

We use RAPIDS cuDF to read the `csv`

file of 200,000 rows and 202 columns within 0.5 seconds. In comparison, it is **10x faster** than the pandas’ counter part. To understand the data, we extract some key statistic metrics of the dataframe. For example, we compute pairwise correlation of columns to study feature interaction. We make the following observations:

- There are no missing values in the dataframe.
- The correlation between any
`var`

columns are very small. Further study shows that columns are actually independent. - The correlation between
`target`

and most`var`

columns is around 0.006. - The
`var`

columns are of gaussian or bimodal distribution.

For example, the kernel density estimation (KDE) plot below shows the distribution of column `var_0`

for both positive and negative samples. The distribution is largely gaussian but a second small bump is also notable, which makes it a bimodal. It is also visually evident that when `6<var_0<13`

the probability of having `target=1`

is low, when `14<var_0<20`

the probability of having `target=1`

is high and hard to tell in other region. This indicates that **a tree classifier** is promising which finds splits like these.

We can also measure the significance of this feature by computing the Kolmogorov-Smirnov KS statistic. Although the two curves have similar trend, the `p_value`

is 0.000 so we can reject the null hypothesis that positive and negative samples are drawn from the same continuous distribution. In other words, the features already have strong predictive power regarding the `target`

column.

We feed the data as is to a Xgboost model and train it with NVIDIA GPUs. Based on the fact that columns are independent, we intentionally set `colsample_bytree=0.05`

and `max_depth=1`

so that the tree considers one column at a time and avoids learning fake feature interactions.

The validation AUC is 0.911 (leaderboard score 0.90) which is a pretty high score with a simple model and no feature engineering. An interesting fact of the competition is that all teams got this score within a day and then stuck there for two months until someone (luckily including us) broke into 0.93+ AUC. Let’s dig deeper and see how RAPIDS can make a breakthrough.

**Feature Engineering**

Given the fact that columns are anonymized and independent, it is next to impossible to engineer new features based on domain knowledge or explore feature interactions between different columns. Consequently, we focused on extract more information from a single column. In the following discussion, we again use the column `var_0`

as an example for analysis and apply the transformation to all other columns.

One of the most common transformations of a single column is `count`

encoding. We used `cudf`

to divide the dataframe by the column `var_0`

into groups and count the size of each group. Then the counts of each different value of `var_0`

is merged back to the dataframe.

To verify the predictive power of such count encoding, we further calculate the mean `target`

value of each group that have the same count values. As shown in the following figure, there is a notable trend that a group with larger count values has lower `mean target rate`

.

Counting encoding of `var_0`

contains information that is orthogonal to `var_0`

and hence should improve our model when used together. We applied the count encoding to all columns and run the XGBoost model with both original columns and the new count encoding columns. The end-to-end running time is **less than five minutes** on a single GPU and the validation `AUC`

is improved to 0.918 from the 0.911 baseline.

# Can We Make Further Progress?

Yes, but to achieve that, we need to look deeper into the `count`

groups. In the following figures, we plot the KDE of `var_0`

, `var_1`

and `var_2`

with different count groups. An interesting pattern is that for all three `vars`

, the group `count==1`

is significantly different from other groups and the variable as a whole. As a matter of fact, this pattern can be found in most of the `vars`

. There are several hypothesis to explain this, such as that `count==1`

group is more noisy in nature, or it is an artifact of data imputation, so on so forth.

Another important observation is that the original ups and downs are more pronounced if we just get rid of the `count==1`

group, which is shown in the rightmost column of the plots, `count>1`

group. This pattern is more obvious when we plot the conditional likelihood, which can be found in my kaggle kernel. This pattern suggests we can make add new features by replacing the `count==1`

group with `None`

and let XGBoost learn how to impute these values from the data optimally. Consequently, we come up with the following implementation.

The validation AUC is improved to `0.934`

**within 11 minutes** and it could be placed top 1% of the competition. The full notebook can be found here. This solution constitutes our best single model. In our full solution, we built an ensemble with a customized neural network model and utilize data augmentation. Our final solution is only 0.002 less than the first place winner of the competition.

# Further Improvement

Currently `cudf`

doesn’t support parallel processing of independent columns of a dataframe. As shown in the notebook, we apply the same transformation to each column with an inefficient `for loop`

. Even if we launch these functions in parallel using `multiprocessing`

, they are still serialized at the GPU side since the `cudf`

calls are queued in the same `Null cuda stream`

. In future, we will expose the `cuda stream id`

to the `cudf`

API so that these calls go to different `cuda streams`

and run fully parallel.

**Conclusion**

Financial data modeling is special: it is easy to get a good baseline by applying a simple model, but it is difficult to make significant improvements with all the limitations. In this article, we utilize RAPIDS data science packages to build new features of a fully anonymous dataset solely based on its statistical pattern and achieved the state of the art result.