Advantages of Models Based on Probabilistic Rules for Small Data Sets
by Tobias Schaefer, Norvard Khanaghyan, and Dmitry Lesnik
How can you make accurate predictions in times of unprecedented change? Statistical machine learning works well if we have sufficient statistics — but what if we need to make decisions based on models that have been developed and calibrated on a small data set? This is particularly important in times of crisis where conditions can change rapidly and, therefore, historical data becomes unreliable when going too far back in time. Therefore, when choosing an algorithm to develop predictive models, it is crucial to pick the most suitable to solve the problem at hand and, obviously, different approaches or algorithms have different strengths: In this post, we present a comparison of models based on probabilistic rules to the performance of decision trees and random forests based on an case study featuring Fannie Mae mortgage data. In fact, we have observed similar results for a variety of different cases, but, to best demonstrate these results, we will focus on this data set due to its relevance to problems facing many of our customers today.
An Introduction of Probabilistic Rules vs Decision Trees and Random Forests
Decision trees can be thought of as a set of mutually exclusive rules. They are popular because of their transparency, in particular for data sets with a small number of attributes. For larger problems, however, the exponential growth of the number of branches makes them less attractive. Also, for some cases, they are known to be prone to overfitting and lack of accuracy.
Random forests are ensembles of decision trees and the classification problem is solved via averaging over the ensemble. This usually helps to reduce the overfitting and improve on accuracy, but for random forests to work well, usually a larger data set for training is required. Also, random forests are much less transparent than decision trees.
Models based on probabilistic rules combine properties of decision trees and random forests: The rules are entirely transparent and interpretable and they are combined using weights and the laws of probabilistic logic. For more background on how to create such models, we refer to our previous posts. They are, in some sense more complex to build, but, once created, they combine the advantages of both decision trees and random forest by being transparent, accurate, and much less vulnerable to overfitting. Moreover, which is a very important aspect for practical applications, the number of rules scales linearly with the number of variables or training set size.
A Case Study on Mortgage Data
In order to illustrate these ideas using a concrete example, we developed models for mortgage defaults using the data from the second quarter of 2010 of the Fannie Mae mortgage dataset. Out of 23440 data points and 14 variables, 5 subsets of the data were retrieved, namely 500, 1000, 2000, 5000, and 10000 rows. Aside from probabilistic rules — implemented via Stratyfy’s Probabilistic Rule Engine (PRE) — we trained decision trees, random forests, and also logistic regression for different data sizes. The following table summarizes the results in terms of the AUC, together with overfitting performance in parenthesis. We measure overfitting as percentages of the difference of the AUC between training sets and test sets.
As expected, for small data sizes up to 2000 rows, the probabilistic rules implemented via the PRE routinely outperform all of the other algorithms. Once we enter the regime of 5000 and 10,000 rows, the performance of the algorithms levels out and becomes comparable.
An additional important aspect of the PRE is the fact that the rule-based approach offers the possibility to incorporate expert knowledge via additional rules to the decision algorithm. In particular when times are uncertain and past performance is not a good predictor of future performance, inclusion of domain knowledge in the decision process is crucial. We will discuss details and examples related to this topic in a future post.
When dealing with rapidly changing market conditions, where the past is not always indicative of the future, probabilistic rules offer a highly attractive alternative to decision trees and random forests. Probabilistic Rules combine the advantages of both of them leading to superior performance. This can in particular be an advantage in changing market conditions, where the reliable part of the historical data is small.
CONTACT INFO: If you’d like to learn more about how Stratyfy can help you develop transparent rule-based models, please reach out to me or email@example.com to see how we can help.