Everything but the. . . : Understanding the impact of proxies in risk modeling
Executive Summary for Managers and Leaders:
What is it?: An academic study of risk models (i.e., models that predict the likelihood of some future adverse event). The authors were probing the validity of the so-called “kitchen sink” approach to data science. That is, if you throw all available data in it works out better! More data is always better. Does this actually result in more effective prediction? The study findings indicated: not necessarily. Or worse, the kitchen sink approach could lead to inappropriate conclusions.
Why should you care?: Risk modeling frequently relies on “proxies” (i.e., substitutes) for an outcome of interest that may be hard to measure or record in data. For instance, the paper authors studied models that predicted the risk of criminal recidivism in an effort to increase public safety. The problem was that the prediction made by the model was the likelihood of future arrest, which is not the same as an increased risk of criminal activity.
The impact of this distinction was lost in part because of the kitchen sink approach. The authors noted that correlations were identified between the 150 input features and the likelihood of future arrest. The effect observed in the study is likely resultant from multiple, well-documented issues in statistics including the multiple comparisons problem and spurious correlation.
Despite the findings in the study, proxies can still be useful, but should be very carefully studied, intentionally chosen, and have all assumptions documented for future review and testing.
What questions should you be asking your DS/ML folks?:
- What do you understand to be the outcome of interest for this effort
- How is that outcome represented in the data that we have available today?
- Is the model relying on any proxy features? If so, which, and why were those chosen?
- What assumptions are you making with respect to the proxies that might prove untrue or work against our goal outcome?
Summary for Data Scientists/ML Engineers/The-Technically-Curious:
What is it?: An academic study of the use of proxy features in predictive risk modeling, specifically in criminology and medicine. The authors found that the use of proxy target variables and an overabundance of input features (not surprisingly) could obfuscate underlying biases, spurious correlations, and drive incorrect inferences. While the use of proxies is prevalent (and frequently necessary) in practical data science, the results in the study are a keen reminder that more data is not a corrective for bad data when developing a predictive model.
What is cool about it?: It’s cool to see a structured, principled, scientific study on the impact of the “kitchen sink” approach to data science. To quote one of the paper’s authors, Professor Julian Nyarko:
“Researchers should be mindful of — and be diligent — when they don’t have the data that they really care about but instead only a proxy,” Nyarko counsels. “And, when we have only a proxy, being mindful in our choice of proxy and making the models less complex by excluding certain data can improve both the accuracy and equity of risk prediction.”
A secondary cool thing is the validation that proxies can be useful (if chosen intentionally).
Questions I am thinking about:
- As data scientists, how can we determine whether the proxies that we are choosing for outcomes of interest, while useful, are sufficiently precise or accurate?
- When have you seen feature selection or cross validation reveal an issue like the ones in this study?
- How well are we helping data scientists to develop greater process understanding as well as methodological understanding, since process understanding will help with identifying useful and accurate proxies?
Zanger-Tishler M., Nyarko J., and S. Goel. Risk scores, label bias, and everything but the kitchen sink, Science Advances 10 (2024): https://www.science.org/doi/epdf/10.1126/sciadv.adi8411