MACHINE LEARNING

Boost the Model Performance with Handcrafted Features

Feature Engineering: The Hidden Gems of Credit Default Prediction

Priyanshu Chaudhary
CueNex
Published in
11 min readFeb 27, 2023

--

“From raw iron comes the plough that tills the fields and feeds the people, and so too can raw data be transformed into insights that drive business growth and innovation through the process of feature engineering.”

In our previous blog, we took a high dive into the world of credit default prediction, exploring the foundational concepts and strategies needed to build a reliable model. Now, it’s time to take things to the next level with the powerful technique of feature engineering. By transforming raw data into insightful features, we can unlock new levels of accuracy and performance in our models, paving the way for more precise predictions and smarter business decisions. With feature engineering as your secret weapon, you can optimize your models like never before and elevate your business’s success.

Raw data is like a jigsaw puzzle with no picture — but with feature engineering, we can put the pieces together to reveal a stunning image of your business’s future! While it’s true that having large amounts of data is a treasure trove for financial institutions seeking to build machine learning models, it’s equally important to acknowledge that not all data is informative.

But feature engineering is not just about selecting the best features. It’s also about reducing the noise and redundancies in the data to improve the generalization of the models. This is crucial in a world where new data is constantly being generated, and models need to perform well on unseen data to be truly useful.

In short, while having tons of data is great, feature engineering is the artistic process of sculpting that data into something beautiful and useful for credit default detection. So join me in this blog as we explore the fascinating world of feature engineering and its role in building high-performance ML models.

Dataset

Our latest study will build upon our previous blog, diving into the world of credit default analysis. We’ll be working with a massive dataset of over half a million customers’ credit statements spanning an entire year. While all features have been anonymized, we do have access to info about the type of feature, including risk, balance, delinquency, and payment features.

However, the raw dataset isn’t without its flaws. We’re faced with a significant amount of noise that can be attributed to a large number of null values and missing statements, possibly due to customers being unable to pay their credit bills or joining the program later. To address these issues, we’ll need to create informative features that can help our model generalize well.

If you’re working with large datasets, you’ve likely experienced the frustration of waiting for hours or even days for Pandas to finish calculating features. This is where Rapids.ai comes in, providing a powerful suite of open-source libraries that leverage GPU acceleration to supercharge your data processing and feature engineering.

With Rapids.ai, you can perform data filtering, grouping, and aggregation at lightning-fast speeds, unlocking the potential of your GPU to accelerate even the most complex data manipulation tasks. This can significantly reduce the time required for feature engineering, enabling you to work with larger datasets and iterate more quickly on your models.

# LOAD LIBRARIES
import pandas as pd, numpy as np # CPU libraries
import cudf # GPU libraries
import matplotlib.pyplot as plt, gc, os
print('RAPIDS version',cudf.__version__)
df = cudf.read_parquet('./data.parquet')

But it’s not just about speed — Rapids.ai also provides a Pandas-like API for data manipulation, making it easy to transition from traditional CPU-based methods. You can use familiar functions and syntax to perform complex data transformations, all while taking advantage of GPU acceleration for maximum efficiency.

In short, if you’re looking to take your data processing and feature engineering to the next level, Rapids.ai is the perfect tool for the job. It’s fast, easy to use, and unlocks the full potential of your GPU to help you build better machine-learning models.

Feature Generating methods

There can be hundreds of ideas that can be used to generate features; however, we also have to make sure that the features contribute to an increment in the model’s performance or that they are just noisy features, that increase our machine’s memory.

The below charts show the ideas that are used in feature engineering and are described below

Aggregation features

Aggregation features are the secret ingredient to making sense of complex data. By computing summary statistics or aggregations of numerical variables over a categorical grouping variable, like customer_ID (C_ID) or product category, we can uncover patterns and trends that might not be visible at first glance. With summary statistics such as mean, max, min, std, and median, we can build more accurate predictive models and extract meaningful insights from customer data, transactional data, or any other numerical data.

These statistical attributes per customer can be calculated

cat_features = ["B_1","B_2","D_1","D_2","D_10","P_21","D_126","D_3","D_42","R_66","R_68"]
num_features = [col for col in all_cols if col not in cat_features] #all features accept cateforical features.
test_num_agg = df.groupby("customer_ID")[num_features].agg(['mean', 'std', 'min', 'max', 'last','median']) #grouping by customerID
test_num_agg.columns = ['_'.join(x) for x in test_num_agg.columns]
  • Mean: the average value of a numerical variable, which can give a general sense of the central tendency of the data. The mean value can capture:
  1. Mean bank balance the customer has.
  2. Average customer spending.
  3. The average time between two credit statements (time between credit payments).
  4. The average risk of lending money.
  • Standard deviation (Std): a measure of the spread of the data around the mean, which can give insights into the degree of variability in the data. The high degree of variability in the balance suggests that a customer spends.
  • Minimum and maximum value can capture the wealth of a customer and also captures information about customer spending and risk.
  • Median: when the data is highly skewed using the mean is not a better idea and hence median value(middle value of numerical value can be used.
  • Last values are perhaps the most important feature as was also suggested in our previous blog since they carry the information about the latest known credit statement issued to the customer.

One Hot encoding features

Using the above statistical attributes for the categorical variables is not sensible since calculating minimum, maximum, or standard deviation doesn’t give us any useful information. So what we should do?

Well, we can use features like count, and the number of unique to calculate features, the last value which can be calculated using

cat_features = ["B_1","B_2","D_1","D_2","D_10","P_21","D_126","D_3","D_42","R_66","R_68"]
test_cat_agg = df.groupby("customer_ID")[cat_features].agg(['count', 'last', 'nunique'])
test_cat_agg.columns = ['_'.join(x) for x in test_cat_agg.columns]

However, this information doesn’t capture whether a customer was categorized in a particular category or not which can be of interest for modeling.

To capture this information, we do so by one-hot encoding the variables and then taking aggregations of the variables such as mean, sum, and last.

The mean value will capture the ratio of the total number of times a customer fall into the category/ total number of bank statements. The sum will simply be the total number of times a customer falls into the category.

from cuml.preprocessing import OneHotEncoder
df_categorical = df_last[cat_features].astype(object)
ohe = OneHotEncoder(drop='first', sparse=False, dtype=np.float32, handle_unknown='ignore')
ohe.fit(df_categorical)with open("ohe.pickle", 'wb') as f:
pickle.dump(ohe, f) #save the encoder so that it can be used for test data as well df_categorical = pd.DataFrame(ohe.transform(df_categorical).astype(np.float16),index=df_categorical.index).rename(columns=str)
df_categorical['customer_ID']=df['customer_ID']
df_categorical.groupby('customer_ID').agg(['mean', 'sum', 'last'])

Ranking-based features

Ranking-based features can be a game-changer when it comes to predicting customer behavior. By ranking customers based on specific attributes like income or expenses, we can gain valuable insights into their financial habits and better manage risk.

With cudf ’ rank function, we can easily calculate these features and use them to inform our predictive models. For example, we could rank customers based on their spending patterns, debt-to-income ratio, or credit score. These features could then be used to predict the likelihood of default or identify customers who are at risk of falling behind on payments.

In addition, rank-based features can also be used to identify high-value customers, target marketing efforts, and optimize loan offers. For example, we could rank customers based on their likelihood to accept a loan offer, and then target those with the highest rank.

A simple code to calculate the rank is

df[feat+'_rank']=df[feat].rank(pct=True, method='min')

where pct is used for whether to do percentile ranking. The ranks of a customer can also be calculated based on categorical features.

df[feat+'_rank']=df.groupby([cat_feat]).rank(pct=True, method='min')

We will see one of the use cases of categorical-based ranking in time-based features.

Combined/Composite Features

One popular method for combining features is linear or non-linear combinations. This involves taking two or more existing features and combining them in a way that creates a new, composite feature. This composite feature can then be used to identify patterns, trends, and correlations that might not be visible when looking at the individual features on their own.

For example, imagine that we’re analyzing a dataset of customer spending habits. We might start by looking at individual features like age, income, and location. But by combining these features in a linear or non-linear way, we can create new composite features that tell us even more about our customers. For instance, we might combine income and location to create a composite feature that tells us about the average spending of customers in a certain area.

Of course, it’s important to keep in mind that not all combinations of features will be equally useful. The key is to identify which combinations are most relevant to the problem we’re trying to solve. This requires a deep understanding of the data and the problem domain, as well as careful analysis of the correlations between the composite features we create and the target variable we’re trying to predict.

The below diagram presents how we should make a combination of features and use the information for our model. Only those features are selected whose correlation with the target is greater than the maximum of 0.9 and the maximum of correlations of features combined.

features=[col for col in train.columns if col not in ['customer_ID',target]+cat_features]
for feat1 in features:
for feat2 in features:
th=max(np.corr(feat1,Y)[0],np.corr(feat1,Y)[0]) #calculate threshold
feat3=df[feat1]-df[feat2] #difference feature
corr3=np.corr(feat3,Y)[0]
if(corr3>max(th,0.9)): #if correlation greater than max(th,0.9) we add it as feature
df[feat1+'_'+feat2]=feat3

Time/Date-based features

When it comes to data analysis, time-based features can be a game-changer. By grouping data based on time attributes like months or days of the week, we can create powerful features that reveal valuable insights about our data. These features can range from simple averages like income and spending to more complex attributes like credit score changes over time.

With time-based features, we can identify patterns and trends that might not be visible when looking at the data in isolation. And by combining these features, we can unlock even more powerful insights. Check out the diagram below to see how time-based features can be used to create useful composite attributes that help us better understand our data. The below diagram illustrates how we can create some useful features by using timestamps.

First, we calculate the mean of values over the month (we can use the day of the month or week of the month, etc.) merge the obtained data frame with the original data and take the difference between the respective features.

features=[col for col in train.columns if col not in ['customer_ID',target]+cat_features]
month_Agg=df.groupby([month])[features].agg('mean')#grouping based on month feature
month_Agg.columns = ['_month_'.join(x) for x in month_Agg.columns]
month_Agg.reset_index(inplace=True)
df=df.groupby(month_Agg,on='month')
for feat in features: #create composite features b taking difference
df[feat+'_'+feat+'_month_mean']=df[feat]-df[feat+'_month_mean']

We can also make rank-based features by using time as a grouping variable which is illustrated below

features=[col for col in train.columns if col not in ['customer_ID',target]+cat_features]
month_Agg=df.groupby([month])[features].rank(pct=True) #grouping based on month feature
month_Agg.columns = ['_month_'.join(x) for x in month_Agg.columns]
month_Agg.reset_index(inplace=True)
df=pd.concat([df,month_Agg],axis=1) #concat to original dataframe

Shift/Lag features

Lag features are an essential tool for effective prediction in financial data. These features involve calculating the difference between a current value and a previous value in a time series. By incorporating lag features into our analysis, we can better understand the patterns and trends in our data and make more accurate predictions.

A lag feature captures the change in a customer’s payment behavior over time. By calculating the difference between a current payment and the previous payment, we can determine if the customer’s payment behavior is improving or deteriorating.

For example, if the lag feature shows that a customer has consistently paid their credit card bill on time for several months, we might predict that they are less likely to default in the future. Conversely, if the lag feature shows that a customer has been consistently late or missing payments, we might predict that they are more likely to default.

# difference function calculate the lag difference for numerical features 
#between last value and shift last value.
def difference(groups,num_features,shift):
data=(groups[num_features].nth(-1)-groups[num_features].nth(-1*shift)).rename(columns={f: f"{f}_diff{shift}" for f in num_features})
return data
#calculate diff features for last -2nd last, last -3rd last, last- 4th last
def get_difference(data,num_features):
print("diff features...")
groups=data.groupby('customer_ID')
df1=difference(groups,num_features,2).fillna(0)
df2=difference(groups,num_features,3).fillna(0)
df3=difference(groups,num_features,4).fillna(0)
df1=pd.concat([df1,df2,df3],axis=1)
df1.reset_index(inplace=True)
df1.sort_values(by='customer_ID')
del df2,df3
gc.collect()
return df1train_diff = get_difference(df, num_features)

Rolling window-based features

These features are nothing but taking the mean of the last 3(4,5, … x) values depending on what works for our data since the latest values based on time carry the latest information on customer status.

xth=3 #define the window size
df["cumulative"]=df.groupby('customer_ID').sort_values(by=['time'],ascending=False).cumcount()
last_info=df[df["cumulative"]<=xth]
last_info = last_info.groupby("customer_ID")[num_features].agg(['mean', 'std', 'min', 'max', 'last','median']) #grouping by customerID
last_info.columns = ['_'.join(x) for x in last_info.columns]

Miscellaneous Features

Till now we created enough features that serve the potential of building an awesome default detection model. However, many more features can be created depending on the nature of the data we have. For example,

We can create features like null counts that calculates the total null values present for a customer and hence helps in capturing feature distribution which tree-based algorithms are not capable of understanding.

def calc_nan(df,features):
print("calculating nan_info...")
df_nan = (df[features].mul(0) + 1).fillna(0) #marke non_null values as 1 and null as zero
df_nan['customer_ID'] = df['customer_ID']
nan_sum = df_nan.groupby("customer_ID").sum().sum(axis=1) #total unknown values for a customer
nan_last = df_nan.groupby("customer_ID").last().sum(axis=1)#how many last values that are not known
del df_nan
gc.collect()
return nan_sum,nan_last

Instead of using mean, we can use modified mean values such as time-based weighted average or hull moving average (HMA).

Conclusion

In this article, we’ve covered some of the most common handcrafted feature engineering strategies that are used in the real world to predict default risk. However, there are always new and innovative ways to engineer features, and we can even let our neural networks do the feature engineering work for us!

In the next part of this blog series, we will explore the exciting world of automatic feature engineering, where we will use cutting-edge techniques like Neural networks to let our models generate features on their own. So, stay tuned and keep an eye out for the next part of the blog series! Until then, happy feature engineering!

--

--

Priyanshu Chaudhary
CueNex
Writer for

Competitions Master @Kaggle.com, Machine Learning @Expedia