A Data Scientists’ Guide to Building Credible Risk Models — Part 1

Sumit Arya
GAMMA — Part of BCG X
6 min readJun 17, 2020

Written by Deep Narayan Mukherjee and Sumit Arya

Advanced machine learning (ML) algorithms are making steady inroads into critical aspects of banking such as fraud detection and collection. Meanwhile, predictive modelling in credit risk management continues to rely primarily on traditional approaches such as Logistic Regression (LR) and, to a lesser extent, General Additive Models (GAM). ML has the potential to significantly enhance the performance of LR-based risk scores, but its use in credit risk modeling has been limited by hesitancy among risk managers to adopt ML. This reluctance may be explained by several reasons:

· Regulatory discomfort: Low comfort levels among regulators with ML models, mostly attributable to limited interpretability and explainability

· Limited understanding of advanced ML: Lack of understanding among business stakeholders of the analytical aspects of ML and how it differs from conventional techniques

· Algorithm focus over risk-outcome focus: Overemphasis among data scientists on utilizing the latest ML techniques at the cost of fully understanding risk aspects

In this first of a three-part series, we will address this third reason and explain how ML practitioners can develop predictive models by keeping fundamental risk considerations on the forefront.

To do so, we focus on two key facets of risk models that can help make ML models more acceptable to risk practitioners: Model Design and Holistic Model Validation.

Aspects of Model Design

A substantial amount of time at the start of the design process should be spent understanding the precise risk problem the model is expected to solve. How exactly the model is to be deployed and used also needs careful consideration. All too often, though, the ML algorithm choice is disproportionately based on a “newer is better” credo or with a myopic focus on higher accuracy. An overemphasis on accuracy as the ultimate measure of model performance is a sub-optimal and potentially risky practice — and at times can reduce the potential business benefit derived from the model.

While accuracy is a critical facet of a risk model, it is not the only one. The following model design features should be considered, while maintaining a balance between model accuracy and other risk considerations:

· Event Definition: Vital to credit risk modeling is the process of “defining the event.” Since “higher GINI” is often the predominant factor in model selection, modelers sometimes choose a shorter performance period when building the risk model. When other factors are held constant, a model designed to predict for a 9–12-month performance period is likely to have a much higher GINI than a model designed for a 15–18-month period. But if 50–60% of defaults happen outside the shorter model-designated event observation period, the model will have limited business value. The temporal decay performance of such models would be poor, giving the model less value in the real world.

Alternatively, the event and the event observation period may be mathematically determined by calculating roll rate (precision) and capture rate (recall), then finding the optimal intersection point between the two. The performance period ultimately chosen for the model should be aligned with the underwriting and collections team, over and above the mathematical optimization.

· Handling imminent defaults and high-risk populations: In any banking scenario there are always borrowers on the brink of defaulting. Such high-risk populations can be easily identified without the need for models or algorithms. And they can easily be excluded from the modelling population using gating rules.

Most modelers use modelling exclusions, such as customers who have defaulted in the immediate past or who are currently in default. Some modelers, however, continue to keep obviously high-risk populations in the modelling sample, which tends to inflate GINI but reduce business value. To be of greater benefit to business users, models should excel at predicting defaults that are less obvious.

· Enhanced use cases of risk score: Until recently, most risk models were used to support only go/no-go credit decisions. With the rising demand for personalization and underwriting treatments differentiated by risk profiles, models are now expected to do a lot more. For example, a well-designed risk model must support allocation of customers to different underwriting work-streams. This can be ensured by creating swim lanes that integrate risk scores with customer journeys, thereby helping to place customers in appropriate treatment buckets. Risk scores must also support pricing and credit limit allocation.

Holistic Model Validation

As noted above, the quest for strong models often focuses on increasing “accuracy.” Accuracy is often referred to as a model’s “power,” in this case to differentiate between “events” and “non-events.” The accuracy of a model is measured by different metrics such as GINI (ROC, CAP), KS, and Precision/Recall. Accuracy is, of course, a prominent attribute of a model, which should be balanced with model stability and interpretability. But the efficacy of the risk model may be more holistically assessed if the following incremental measures of model performance are also considered:

· Rank ordering of events with score: This is a standard practice in risk management, though often overlooked by ML enthusiasts. Rank ordering is good indicator of risk differentiation across the predicted score buckets, and when exhibited across test and out-of-time samples it can also help assess a model’s performance stability. When using models in risk-decision making, the rank ordering determines the specific operating thresholds based on the risk score.

· Temporal decay: The risk differentiation of a score should ideally hold over the entire maturity of the exposure. The model should thus be tested for its predictive power by evaluating events even outside the model-defined performance period. The ultimate reason for building a model is to reduce loss over the life of the loan, so it is important that the model has reasonable predictive power outside the model-defined performance period.

· Reverse bi-variate: “Black-box” models, which may include arcane AI logic and modeling, are considered poor work products by many data scientists who, instead, use “explainable AI” schemes. The transparency of even these schemes is a matter of debate but can be improved using simple approaches, such as plotting reverse bi-variate. Observing the pattern of individual x-variables (predictors) over the predicted probability from ML algorithms enables data scientists to measure the business intuitiveness of each predictor.

Push ML Intervention for the Benefit of All

The current hesitancy among credit risk managers and regulators to rely solely on ML models is likely to endure for the foreseeable future. Thus, the foundation or anchor score in any risk assessment framework will continue to rely on logistic regression output. The performance of the risk-assessment framework can be enhanced, however, by judiciously superimposing advanced ML models over foundational LR model scores. The impact of doing so will improve performance at the margins, but by meaningful amounts — especially for large-volume lenders. Thus, ML intervention should be a standard practice, at least among larger users.

To ensure that the practice of risk assessment reaps the benefit of advanced ML, data scientists should work towards enhancing the credibility and usability of their output. If advanced ML risk models are to pass the justified scrutiny of risk managers, ML practitioners must make the effort to address, in totality, the risk aspects of predictive solutions. By making model design and holistic model performance assessment central facets of their modelling efforts, data scientists can provide end users with tools that provide better risk prediction, more tailored customer service and a stronger bottom line. We encourage you to stay tuned for parts 2 and 3 of this series where we will explore model design and model performance assessment in greater depth.

--

--