Member-only story
Detecting leakage in machine learning pipelines using NANs/complex numbers
A simple and precise way to detect data leakage
Data leakage in machine learning pipelines can cause havoc for your model. In this post, I’m going to share an amazingly simple way to detect data leakages using NANs and complex numbers while treating your ML pipeline as a black box. I’ll talk very briefly about what data leakage is. I’ll also talk about leak-detect
, a python package I’m releasing to do all this in one line code.
A quick intro to data leakage
The most precise way to describe data leakage could be this:
Data leakage in an ML model occurs when data used to create predictor variables during training time is unavailable at the time of inference.
Clearly, using data(features) unavailable at inference time during training leads to model underperforming in production. This under-performance could mean millions of lost dollars depending on the scale of your company!
An example of leakage
What are some ways feature creation pipelines can introduce data leakage?
- Using target or data used to create target for feature engineering.
- Using data from future periods for feature engineering.