Too good to be true…

So, Thursday last week, after a whole day of struggling, I and my partner created a model with a pretty good performance. I was pretty excited to show that to a friend, who is Machine Learning Engineer. And after 30 seconds looking just at the dataset, she said it was not going to work. And I was like “What do you mean? My R-squared is 0.96”. And she was like: “Yea you do, but it’s not working because you have data leakage!”
What is Data Leakage?
Data Leakage happens when the data you are using contains information on what you are trying to predict.
Data Leakage usually results in unrealistic level of performance on the test data because the model has already somewhat seen data beforehand and learned from it. And from there, easily deliver an almost perfect prediction. The model appears to work great on training data but would not generalize to unseen data.
Another definition of Data Leakage is when data used at training time is unavailable at inference time. In other words, it’s the problem results from using variables in your prediction model that are known only after the act of interest.
When and How Data Leakage can happen?
The easiest way to know if your model has leaking data is when the model performance is too good to be true.
Data Leakage have many causes and can occur anytime in your models. In general, you should not perform any transformation on your training data set that involves knowledge of your test data.
- Pre-prossesing: When the data is improperly split into training, validation and test sets. When the same data is used to train and evaluate a model, this will lead to error metrics that underestimate the true generalization error. This is quite similar to overfitting.
- Creating features from data that are related to the target: Data Leakage can appear if the input data and the target are in some way accidentally or purposely connected.
- Temporal Data:
- Data that has Time related characteristics
- Data points that distributed sequentially over a period of time. An example of this would be a dataset consists of a training set of 2 points A and C and a test of point B. Supposed that the ordering of these points is A →B →C. We can see that by training on point C and testing on point B, we create a data leakage situation in which we train the model on the information that doesn’t exist at inference time or that only exists in the future.
How to avoid Data Leakage?
- Properly split your data: This will help prevent overfitting.
- Use Cross Validation when pre-processing your data
- Keep away from Time-Related Features as much as you can: forecasting problems like predicting weather or stock market price often requires time series analysis. You can just simply remove the time-related features because it’s a complicated task and you might end up remove data that is valuable to your model.
- If you must work with a time-related dataset, you should split your data across time so that all of your training data occur before your test data.
These seem pretty obvious but when dealing with a huge amount of complicated data, you will likely introduce Data Leakage into your model.
Example

In this project, we were trying to predict the lowest price of products based on certain characteristics like the number of reviews, categories the products belong to, the rating and the current and highest price of the products.
This violates the first definition that I have above. You can see that, highest price and current price are clearly related to the target, lowest price.

We also violated the second definition that we would not be able to know these information (highest and current price) at the time of we try to predict the lowest price in real life.
Source:
https://towardsdatascience.com/data-leakage-in-machine-learning-10bdd3eec742
