Member-only story
Missing Data? Use Explainable AI to Fill the Gaps (Correctly)
Random Forests, XGBoost, and Neural Networks for interpolation without the black box problem
I used to work as a scientist in theoretical particle physics. It was super data-heavy, but I generated most of that data myself with Monte Carlo simulations. If a data point was missing, I just generated a new one.
These days, I work with real-world data in finance. I process tons of datapoints that companies produce, and figure out what it means for a company’s ability to make money.
You might be surprised to hear this, but even bluechip companies can be incredibly sloppy about reporting consistent data. This complicates my job — and I know that for many other data professionals out there the situation is even worse.
If you’ve ever built a model from imperfect data and found that it only returned garbage, then you know exactly what I mean. Garbage-in-garbage-out is true. But incomplete-in-garbage-out is true, too. Sadly.
Many data scientists respond to this challenge with simple tools. Techniques like mean imputation and forward are easy to implement. But these techniques are just band-aids, not a cure. In many cases, they improve models but still lead to distorted results.