A Data Scientists’ Guide to Building Credible Risk Models — Part 2

Published in

GAMMA — Part of BCG X

10 min readOct 19, 2020

Written by Deep Narayan Mukherjee and Sumit Arya

In Part 1 of this series (https://medium.com/bcggamma/a-data-scientists-guide-to-building-credible-risk-models-part-1-9dcf8f41dd3c), we described how ML practitioners can build models while keeping fundamental risk considerations at the forefront. We also provided a brief description of the concepts of Model Design and Holistic Model Validation. In Part 2 of the series, we do a deep dive into important aspects of model design, focusing in particular on why model design requires special focus and which model design characteristics are the most important.

Why is Model Designing Important?

The question of what exactly a model is trying to predict is not merely rhetorical. Despite the fact that some analytical teams may not be giving this question the thought and attention it deserves, it remains one of the most fundamental questions in the process of model design. To be fair, most analytical teams do spend a significant amount of time defining the specific event to be predicted. What they often fail to consider, however, are the explicit trade-offs that arise in the business utility of choosing one event definition over a competing definition. To some extent, as long as the event definition makes broad business sense, statistical considerations and analytical convenience supersede business considerations. Likewise, borrower level default is more comprehensive. But at times absence of comprehensive data on the borrower, only product level defaults are considered. The issue being the structural risk of the borrower may be wrongly estimated. In short, modelling becomes less of a business problem-solving exercise and more of the statistical exercise. Nevertheless, the best-in-class model design should always have as its goal finding the optimal balance between business value and statistical relevance. Factors that can affect this balance include:

Statistical soundness versus business benefit: Let’s start with the example of an application risk score built for an unsecured consumer term-loan product. In India, such loans are called personal loans, and these tend to have a maturity of 3 to 4 years. The risk score that was developed in this case had an event definition of 60 days-past-due (DPD) in the 12 months following loan disbursement. It was a statistically sound model with a very robust prediction accuracy for the event. Even during production, the Gini was very comparable to the development Gini, with both in the 0.4 to 0.45 range.

The business objective of the risk model was to reduce portfolio losses. So, one would expect the model to be able to reasonably predict the number of defaults over the life of the loan. If a model fails to do so, modelers may argue that what happens beyond the defined performance window is not their responsibility. Risk and business teams often address a model’s insufficient predictability outside a narrowly defined window by creating credit policy rules.

For loans such as this, the peak number of defaults tend to happen 18 to 24 months after loan origination. When the predictive power of the above model was tested for events such as 60 DPD in 18 or 24 months, the Gini dropped dramatically to 0.2 to 0.25. When the model was tested for 60 DPD or actual write-offs over the entire life of the portfolio, the Gini dropped to single digits. Clearly, the Gini was largely irrelevant to the business outcome because the scope of the model was so narrowly defined. The model failed because of a lack of clarity about what it was supposed to predict and not because it was statistically incorrect.

The Business Use Case of the Risk Score: In the previous era of limited data availability and processing power, most risk models had just one use case: to make go-no-go decisions and provide input to the underwriter. Exception being the most advanced banks whose usage of risk models was not limited to go-no go decisions. Given today’s emphasis on personalization, risk scores are now often used to place loan applicants into specific risk-based swim lanes and receive differentiated treatment. Apart from predictive accuracy, the model calibration has also become increasingly important. The risk score must provide a well-calibrated and granularly representative risk measurement over the entire population. This requires models to move beyond simplistic variable selection approaches. Variable processing needs to move beyond the choice of dummy and course or fine binning. Economically intuitive nonlinear transformation and composite variables usage, not new approaches in themselves, must become the norm as the focus moves differentiated underwriting customer journeys.

Data-analytical Capacity: Robust model building depends on the availability of historical data, along with an organization’s efficiency in building and deploying risk models. In new lending institutions, or institutions in which data capabilities are underdeveloped, high-quality historical data is usually unavailable. In such cases, a 12-month model may well be justified. Such institutions argue that the through-the-door (TTD) population of applicants continues to evolve and, thus, a longer-period model may not be useful. This is a pragmatic argument, but it means that the models must be rebuilt every 12–15 months to support this approach. In addition, if the TTD population rapidly evolves, there may be a stationarity issue with the data itself, which, in turn, would call for a fundamentally different risk-assessment approach.

It is for these reasons that data scientists should dedicate a substantial amount of time at the start of the design process to understand the precise risk problem they expect the model to solve.

What are the aspects of Model Design?

The design process is comprised of several aspects that must be considered in conjunction with the risk problem. A prudent risk model should be correctly configured with the following optimally designed parameters:

1. Segmentation: What kind of segmentation schema should be adopted?

2. Event Definition: What is the appropriate definition of a bad event?

3. Gating Rules: How should high-risk population be treated?

4. Enhanced Use of Risk Score: What considerations should be taken when deploying a model?

1. Segmentation: Deciding the appropriate segmentation schema is a critical step because it influences the overall accuracy of the risk algorithm and, at the same time, can affect algorithm adoption among business stakeholders. Three common ways in which segmentation schema can be decided include:

Event-rate-based segmentation: The popularity of this mode of segmentation is driven by its intuitive appeal, along with the fact that, if one uses a decision-tree algorithm such as CHAID, event-rate-driven segments can be easily identified. If the end use case of the risk score is simply a go-no-go decision, this approach will suffice. Doing so is analogous to de-averaging the event rate in multiple segments. An example of this type of segmentation is geo-based segmentation schema wherein certain geographies with a considerably higher event rate are modelled separately.
Predictor-based segmentation: This approach focusses on identifying fundamental default drivers. In this approach, each segment has different default drivers — or the same drivers but with significantly different intensity. This means that segments with similar event rates may be assigned to different segments because the structural drivers of risk are different. For example, these segments may be divided into salaried borrowers versus self-employed borrowers, or into segments driven by industry categories. The business intuition behind the segments need to be validated with methods such as ANOVA to ascertain whether the drivers of delinquency are indeed statistically different. Such segments tend to have better ease of adoption among the users of the risk score.

Information-based segmentation: The richness of information may not be uniform across the entire modelling population. Alternatively, certain variables may have rich behavioral information about a certain type of borrower, while other variables may not. For example, customers can be segmented according to whether they have credit cards, since features such as credit card utilization, credit card vintage, or number of enquiries can serve as sound risk predictors.

The appropriate segmentation schema should be decided by performing a thorough exploratory analysis of all three methodologies. The final schema may eventually be based on a hybrid form of explained themes. A suitable segmentation schema would help ensure that the defined risk objectives are met, and the resultant algorithm is accepted by the business stakeholders. Irrespective of the segmentation schema, it is preferable for segmentation variables to be categorical variables. When ordinal or continuous variables are used as key segmentation variables, one must be sure of their stationarity. Otherwise the segmentation choice may lead to model underperformance.

2. Event Definition:

With everything else being equal, a model with a shorter prediction window is likely to have a higher statistical accuracy (as in Gini) than a model with a longer performance window. As a general rule of thumb, the peak delinquency levels of a cohort tend to coincide with amortization of 30% to 40%. By the time of peak delinquency level, typically 50% to 80% of all defaults events to be experienced during the tenure of the cohort usually reveal themselves. Ideally, the model’s performance window should try to cover the period of peak delinquency of any cohort. The performance period should be as close as possible to the peak delinquency period, but constraints may arise from data availability and product. Home loans tend to have peak defaults between 24–48 months, but data may not support models more than 36 months. The optimality is decided by roll-rate capture rate.

The event definition should be chosen not only on the basis of ease of statistical modelling, but on its ability to capture the majority of defaults. Only then it will have real business value. Vintage curves can be used to choose the event definition by plotting the cumulative number of defaults or the outstanding amount of defaulting loans. A more widely adopted approach is to choose the definition that maximizes both roll rate and capture rate.

The idea of choosing an apt definition is to clearly articulate the model’s business purpose — one that holds true even while measuring the temporal decay (i.e., the model rank-orders outside the event definition period as well). Hence, the event definition should serve the dual objectives of accuracy and risk considerations. Doing so greatly enhances the value of the model in the real world.

3. Gating Rules:

In any loan portfolio there are always customers on the brink of defaulting. Such high-risk populations can be easily identified through the use of “gating rules”:

Gating rules are small pockets of population (~0.5% — 1% of the complete portfolio) with extremely high default rates. These rules are formed using composite features involving multiple variables. The rules tend to capture dimensions such as increased recent credit hungriness or clear deterioration of credit behavior — if not by actual default. An exhaustive set of gating rules or composite features are usually the outcome of a curated decision-tree algorithm. The most relevant, important rules are then selected using optimization algorithms from among hundreds — or sometimes thousands — of such rules.

If these high-risk populations are included in a modelling population, they will inflate the Gini of the model, but it will be of less value in real world. To be of greater benefit to business users, these small pockets of high-risk populations should be excluded from the population.

Customers flagged by gating rules should be excluded at the start of the model run using filtered customers. Borrowers whose risk profile is palpably impaired need to be removed earlier from the scoring que as an effort to rate them has limited business value. In addition, removing such high-risk profiles from the modelling sample would cause the model parameter estimation process to work harder to identify the “non-obvious” defaults. Of course, removing such easily identifiable high-risk profiles from the modelling sample tends to moderate model-accuracy measures such as Gini. However, the temptation to see a higher Gini must be moderated to achieve a more robust predictive model.

4. Enhanced Use of Risk Scores:

As noted above, risk models were, until recently, made using LR for go / no-go credit decisions. Amid the rising demand for personalization and underwriting treatments differentiated by risk profiles, models must now be able to deliver more granular results:

A well-designed risk model must support allocation of customers to specific underwriting workstreams. This is achieved by integrating risk scores with customer journeys, thereby helping to place customers in the appropriate treatment buckets. This approach typically makes the algorithm widely acceptable to business stakeholders.

Risk scores are used as input in other critical use cases such as pricing algorithms, credit-line estimation and loan tenure. As mentioned in Part 1 of this series, there is a scope of judicious application of ML. ML models score may be ensembled over foundational LR scores. The improvement in predictive accuracy while not large, the impact becomes significant ion high-volume players. Increased accuracy also translates into higher business relevance by leveraging differentiated treatment of customers.

Conclusion

By spending substantial time at the start of the model design process to strike a balance between business value and statistical relevance, data scientists can enhance the model’s eventual business relevance and, in turn, its business value. Choosing the appropriate segmentation schema, chalking out suitable event definitions, upfront gating of high-risk populations, and deploying differential treatment enhances the acceptance of risk models among business stakeholders. Fixation solely on model accuracy measures — at the exclusion of everything else — may not achieve optimal business value. By asking the questions posed in this article before venturing into building credit risk models, data scientists can increase the likelihood that the models will have significant value in the real world.

In Part 3 in this series, we will address holistic model validation, examining the range of validation techniques, what they imply for the model performance, and their bottom-line business impact.

A Data Scientists’ Guide to Building Credible Risk Models — Part 2

Why is Model Designing Important?

What are the aspects of Model Design?

Conclusion

Written by Sumit Arya