Data transformations for modeling loan default

Published in

Peter Caya’s Blog

9 min readSep 27, 2022

I’ve been doing some work in my free time to create models which forecast loan performance for banks using macro-economic variables. Outside of obtaining bank level data on loan performance, the most crucial part of this has been choosing what economic data to include as features — more specifically, what steps need to be taken to ensure the stability and out-of-sample performance of the model.

Macro-economic data is free, widely available, comes from reputable sources, and is clean but the trends and distribution of the data means each series needs to be evaluated in context to the modeling method to decide if transformations are necessary. Ignoring the underlying theory of the model method can lead to a false sense of security, degradation of the model overtime, or countintuitive model theory.

This post will highlight the effects that can come from using untransformed data. Some common transformations will also be discussed. Finally, the method of fractional differencing will be discussed as well as its motivation and potential advantages over more common transformations.

Motivation for transforming data series

To start, we can consider the home price index (HPI) and the unemployment rate as key features to performance of residential real estate loans. These features are both easily accessed and intuitively connected to events like loan delinquency. I’ve downloaded the economic data series from FRED using the fredr library. I downloaded the residential loans that are 90+ DPD from the FHFA’s National Mortgage Database page. I’ve plotted the results below for comparison. Because these three series have very different scales, I’ve normalized the data so that it fits on the same scale for the purpose of comparison.

Comparison of 90+ DPD with relevant economic data

We can see from the chart that there is a clear relationship: As unemployment climbs and HPI falls, the rate of 90+ DPD increases. This matches our theoretical expectations. The clear relationship makes fitting a linear regression model a tempting solution but this approach will also lead to the structure of the data to influence the regression coefficients for our model.

Consider HPI:

We can see that the HPI values vary but there is almost always an upwards trend. Fitting a model on this data will like cause it to be overfit on the time-period and the quality of the results is prone to degrading outside the training range. This is especially likely if there is a decline in HPI which is problematic for risk models because the model accuracy would degrade exactly where it would be needed most!

Linear regression will only assess the average impact of the HPI on the rate pf 90+ DPD and will lack sensitivity to rises or drops. This is also a problem for theory since in the past the trigger for high 90+ DPD loans has been a sharp drop in HPI, not the level of the variable itself.

Theoretical issues for macro-economic data

The untransformed data may also be problematic for reasons that conflict with model theory and degrade performance. Linear regression has several expectations of the data including:

Linearity between the variables used for the model and the dependent variable.
Full rank — none of the features is composed of a linear combination of one or more other features.
The error terms for the model are not a function of the independent variables used to construct it.
The values of the data can’t be explained using earlier data points. IE, I can’t predict the current HPI value using last quarter’s value.
A constant variance is maintained in the data.

Two important components that often lead to all other problems in macro-economic data are autocorrelation and nonstationarity which are described in points 4 and 5.

If we fit a simple model on an auto-correlated time-series we will have an important component of the the input data that we aren’t accounting for which will impact accuracy and model behavior. The theory of many models also assumes that the data is stationary; That the data has a constant variance over time with no trend.

I’ve created two charts below to show that the HPI series contains both of these elements. The first series plots the correlation between the HPI data with itself over a set of different lags. The second chart is a plot of the variance over the life of the HPI data. We can easily see that as time goes on, the variance rises.

Plot of HPI series variance using a window of 8 quarters

There is a rich and well-developed literature on detecting and interpreting autocorrelation and nonstationarity so I won’t perform a deep dive on the subject. To read further, I recommend A Very Short Course on Time Series Analysis and Econometric Analysis.

Now that we’ve explored these considerations, it’s time to determine how we can transform this data to create a more accurate and interpretable model.

Commonly used data transformations

We’ve discussed the impact that untransformed data can have on attempts to model data. The next step is to discuss how we can remove the structural issues in the data for use in modeling. We’ll start by reviewing the most common variable transformations and review the impact that these transformations have.

The most commonly used transformation is simple differencing of the data:

A simple alternative would be to use the percentage change of the data:

A common example is the transformation of the home price index into home price appreciation (HPA):

Comparison of HPI, differenced HPI, and HPA

We can see from the comparison above that there is a lower level of autocorrelation in the HPA data than in the untransformed data.

Fractional differencing

The transformations described in the previous section are effective at removing problematic data qualities like autocorrelation and making the data series stationary, but also have the impact of making the transformed data lose its resemblance to the original data series. This is typically not problematic but it may influence the accuracy of the model that’s ultimately generated.

To arrive at the concept of fractional differencing, we start with the formula for 1 period differencing shown earlier:

Let’s define a differencing operator below:

The differencing formula from earlier becomes

Defining B allows us to treat it algebraically. We can utilize the binomial series with d entries:

This series can be used to create our list of weights for lags:

The weights can be simplified to the equation below:

Recursive equation for lag weights

We can leverage the framework above to difference data in a manner which isn’t based only on integers (lag of 1 or 2). This strategy is referred to as fractional differencing and it allows users to balance between simply differencing the data and leaving it untransformed.

Consider a trivial example where we use a difference of 1. In this case the weights become:

We apply these values as weights to transform the original data which yields:

A more useful example is letting d=.4 which produces the following weights:

Weights to be used when d = .4

I’ve implemented this rule in the code below (adapted to R from Marcos Lopez de Prado’s Python code:

Code to generate the weights to be used in fractional differencing

Using the code above, we can demonstrate the way the lag weights change as the parameter d is varied:

Note that the weights trail off as the lag values get larger. This demonstrates that data from the past is still being factored in. The lower the d value, the more weight will be given to observations in the past when compared to recent observations.

For practical purposes we truncate the weights. This is a fixed fractional window. You can find my code implementing the fixed fractional window, and applying it to data in below. This strategy is again adapted to R from de Prado’s book.

Procedure

The discussion above raises the question: How do we select d? Here, I’ve implemented and presented results based off of Marcos Lopez de Prado’s book which suggests the following framework:

Create alternate sets of fractionally differenced data using a range of lags, d.
Use the Augmented Dickey-Fuller (ADF) test over each set of transformed data.
Select the lowest d-value allows us to reject the null hypothesis of non-stationarity.

I’ve implemented these steps in the function below:

While the framework shown above is straightforward, it isn’t the final word on what amount of fractional differencing to use. For instance, we could use the KPSS test instead of ADF. I’ve plotted an example of the output when used on a monthly series of data from the SP 500 below:

I used this heuristic to select values for d for monthly data for HPI and SP500. The results are presented in the charts below along with the untransformed data and the data after simple differencing.

We can see two separate outcomes in the charts above. For HPI — it actually took a fractional difference greater than 1 to remove stationarity. We arrive at a value which is very similar to simple differencing. For the SP500, d = .34. When we use this value of d in the fractional differencing function we arrive at a series which is stationary, but also contains the general trend information originally present in the untransformed data.

Conclusion

The most important component of building any model is to ensure that quality data is used appropriately. High quality macro-economic data can easily be found, but there are practical and technical barriers to using it to directly create forecasts. Nonstationary data which has autocorrelation or other qualities can create models that are missing valuable information at best. At worst, the data can create models which give a false sense of security and aren’t fit to forecast outside of the training range.

Econometric literature provides a wide range of alternative methods that can be used by a practitioner with domain knowledge to generate features which are intuitive and effective at explaining the phenomenon. I discussed a few of the more basic strategies like differencing.

While strategies like differencing and taking the percentage change are intuitive and often effective at removing stationarity and autocorrelation, they also sacrifice information about the general information of the data series that might be useful in the model. Fortunately, Hosking introduced a strategy called fractional differencing which allows a modeler to strike a balance between preserving the general trend of the data while still removing problematic elements. In this post, I used notes from Marco Lopez de Prado’s book to create a practical implementation of this strategy and compare it to more standard strategies.

The results showed that in some cases, it may help the modeler preserve the general shape of the feature while removing autocorrelation and enforcing stationarity. Using the Augmented Dickey-Fuller test to determine what parameters balance between transforming the data and preserving memory provides a useful heuristic, but other strategies like using autocorrelation or other tests of stationarity may also be useful.

Resources and References

All code used to generate analysis for this post is located in this post’s repository.

I consulted the following books, articles, and blog posts for this article:

Advances in Financial Machine Learning — Marcos Lopez de Prado
Macroeconomic Forecasting with Fractional Factor
Models — Tobias Hart
Preserving Memory in Stationary Time Series — Simon Kuttruf
Fractional Differencing — J.R.M. Hosking