Credit Risk Modeling Handbook

Credit Scoring: Prepare Your Data Right (Part 3)

Essentials, techniques, and tools for effective data preparation and exploration.

Natasha Mashanovich

Published in

DataDrivenInvestor

6 min readMay 15, 2023

Garbage in, garbage out.

This is a commonly used axiom in computer science and a threat to a project’s success — the quality of output is largely determined by the quality of input. Therefore, data preparation is the key aspect of any data mining project including the development of a credit scorecard. This is the most challenging and time-consuming phase of the CRISP-DM cycle. At least 70%, sometimes more than 90%, of total project time is dedicated to this activity. It involves data collection, combining multiple data sources, aggregations, transformations, data cleansing, “slicing and dicing,” and looking at the breadth and depth of data to get a clear understanding and to transform the quantity of data into the quality of data so that we can prepare confidently for the next phase — model building.

The previous article of this series discussed the importance of a model design and identified the main components including unit of analysis, population frame, sample size, criterion variable, modeling windows, data sources, and data collection methods. Careful consideration of each of the components is imperative for successful data preparation. The final product of this stage is a mining view encompassing the right level of analysis, modeling population, and independent and dependent variables.

*Model design components (image by author)*

Data Sources

The more, the merrier

As part of the CRISP-DM data understanding step, any external and internal data sources should provide both quantity and quality. Data must be relevant, accurate, timely, consistent, and complete while being of sufficient and diverse volume to provide a useful result in analysis. For application (origination) scorecards where there is a limited amount of internal data, external data has prevalence. In contrast, behavior scorecards utilize more internal data and are typically superior in terms of predictive power. The common data sources that would be required for customer verification, fraud detection, or credit grant are outlined below.

*Data sources diversity (image by author)*

The Process

The data preparation process starts with data collection, commonly known as the ETL process (extract-transform-load). Data integration combines different sources using data merging and concatenation. Typically, it requires the manipulation of relational tables using a number of integrity rules such as entity, referential, and domain integrity. Using one-to-one, one-to-many, or many-to-many relationships, the data is aggregated to the desired level of analysis so that a unique customer signature is produced.

Data preparation process (image by author)

Data exploration and data cleansing are mutually iterative steps. Data exploration includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions to correlations, cross-tabulation, and characteristic analysis.

Univariate view with Altair Analytics Workbench (image by author)

Data visualization with Altair RapidMiner (image by author)

Following exploratory data analysis (EDA), the data is treated for better quality. Data cleansing requires good business and data understanding so the data can be interpreted in the right way. This is an iterative process designed to remove irregularities by replacing, modifying, or deleting these irregularities as appropriate. Two major issues with unclean data are missing values and outliers — both can heavily affect the model accuracy and careful intervention is therefore imperative.

Missing Data

Before a decision is made on how to treat missing values, we need to understand the reason for missing data and understand the distribution of missing data. We can categorize it as one of the following:

Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)

Missing data treatment often assumes MCAR and MAR, while NMAR is more difficult to deal with. The list below provides the common data treatments ordered by complexity.

*Missing data treatments (image by author)*

Outliers

Outliers are another “beast” in our data, as their presence can violate statistical assumptions under which we develop a model. Once identified, it is important to understand the reasons for having outliers before applying any treatment. For example, outliers could be a valuable source of information in fraud detection; hence, it would be a bad idea to replace them with a mean or median value.

Outliers should be analyzed using univariate and multivariate analysis. For detection, we can use visual methods such as histograms, box plots, or scatter plots and statistical methods such as mean and standard deviation, clustering by examining distant clusters, Mahalanobis distance or machine learning methods such as small decision tree leaf nodes, Local Outlier Factor (LOF), Isolation Forest, and Support Vector Machines. The judgment of what should be considered an outlier is not as straightforward as identifying missing values. The decision should be based upon a specified criterion, for example, any value outside ±3 standard deviations, or ±1.5IQR, or the 5th and 95th percentile range would be labeled as an outlier.

Outlier detection using LOF with Altair RapidMiner (image by author)

Outliers can be treated in a similar way as missing values. Other transformations can also be utilized, including binning, weights assignment, conversion to missing values, and logarithm transformations to eliminate the influence of extreme values or Winsorization.

As discussed above, data cleansing may involve the implementation of different statistical and machine-learning techniques. Even though these transformations could create superior scorecard models, the practicality of implementation must be considered as complex data manipulations can be difficult to implement, can be costly, and can slow down model processing performance.

Special attention should be made to any data transformation, including imputation and treatment of outliers as this can often impose data leakage during the modeling process and produce misleading model validation results.

Data Transformation and Feature Engineering

Once the data is clean, we are ready for a more creative part: data transformations and feature engineering. The ultimate aim is to improve the performance of the predictive model. Data transformation and feature engineering are the creation of additional (hypothesized) model variables that are tested for significance. Data transformation usually refers to the process of converting data from one format to another, whilst feature engineering involves creating new features from existing ones based on domain knowledge, and as such is a more challenging process.

The most common transformations include binning and optimal binning, standardization, scaling, one-hot encoding, interaction terms, mathematical transformations (from non-linear into linear relationships and from skewed data into normally distributed data), and data reduction using clustering and factor analysis.

Apart from some general recommendations on how to tackle this task, it is the responsibility of the data scientist to suggest the best approach to transforming the customer data signature into a powerful information artifact: the mining view. This is probably the most creative and the most challenging aspect of the data scientist role, as it requires a solid grasp of business understanding in addition to statistical and analytical skills. Very often, the key to creating a good model is not the power of a specific modeling technique but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.

The rest is the art of feature creation…