Beware of Data Leakage

Don’t say it will never happen to you

Published in

YellowBlog

8 min readJun 26, 2017

Every data scientist, I believe, knows the feeling. You are working on a supervised learning task. You learn all of the business nuances, collect data, extract features, spare an independent test set, train a model, and then test it over the test set. Boom! 99.6% of accuracy. After a split second of pure happiness and satisfaction, you realize that as tempting as such an outcome might be, something went wrong in the process. In many cases, the reason is Data Leakage.

Data leakage is the unintended existence of information in the training data that will not be available in production, allowing the machine learning algorithm to perform better in the research environment than it ever could in reality. This is a very common pitfall that many projects fall into, and it may appear in many different forms, some of which are incredibly hard to catch. The effect of data leakage on the actual performance of production models and on the total effort spent on data modeling could be huge.

In this post, we will review some of the more common data leakage patterns that you should be aware of and recommend the best practices which will help you avoid this malicious pitfall or at least eliminate its impact.

1. Unfair inclusion of proxy variables

In statistics, a proxy variable is a variable that has a high correlation with the target. So high, that we can use the proxy variable as a surrogate target. When there is a “legitimately” available proxy variable at hand, the supervised learning should be relatively easy. For example, using the annual expenses of a household is often a good proxy for that household’s income. If we have the expenses available, we can predict the income quite well.

However, in many cases, proxy variables are not real predicates, but some side effect of the actual target which exists in the training set only because the target values are also there. For example, let us assume that a churning customer, at a given company, must call the contact center, to churn. Naturally, churning calls tend to be longer than other calls. In this case, all of the churning customers will have a relatively long call to the contact center, and the duration of the recent call to the contact center is an unfair proxy variable. It is unfair because it is not a prior alert to the churn, but a direct outcome of that churn. In future predictions, we will not have that indication at hand at the prediction point (it becomes available only when it is too late).

When an unfair proxy affects all of the training and testing examples, it is typically easier to catch it. The impact of such a proxy might be trickier when it affects only some of the examples.

2. Data that is available for research but will not be available for production

Building up an initial supervised learning model is a research effort. Before building such a model you typically collect and prepare the input data. This preparation is a one-time effort (or at least an effort that you only make once in a while). As such, we often see companies gathering as much information as possible (e.g., attaching information from external sources), and running many ad hoc data enhancement efforts (e.g. manual data cleansing).

Production is a totally different environment. Data in production is fast and typically comes from production systems. It is often the case that some of the data that was used for training is not maintained in production. If that unavailable data contains significant information, its leakage to the training set will result in a poor performing production model.

A specific example of data that will not be available in production is proxy variables. For example, in the income prediction task, it might be the case that we only have the expense data for the households in the training set (thanks to local expense research) for which we also have the income, but we will not have that data for future households.

3. Time series leakage

Many supervised learning tasks are a time series: predicting the future based on information from the past. Analyzing a time series is challenging for many reasons — data leakage is one of them.

When modeling a time series, we sometimes see explaining features that describe the future. For example, a country’s GDP per capita can correlate with many economic measures. Using GDP to make a prediction might be tricky because the estimated GDP for a specific period typically get published sometime after that period ends. Predictions related to that period typically require some time before the period begins. So, using GDP as a predictor is simply impossible, because it is information from the future.

Another form of leakage that can occur in time series is mixing the past and the future. The training set includes records from different timestamps. We want to use past observations to predict future ones. Not vice versa. If we randomly split the input set into train and test, we are mixing the past and the future, and it might lead to wrong performance estimations.

4. Train to test leakage

The common supervised learning practice involves sparing an independent subset of data for testing. Train to test leakage occurs when some of the knowledge that is needed for training makes use of the test set as well. Let it be data standardization (or any other sort of normalization), completion of missing values, or tuning the algorithm. If the entire training flow involves supplying values to parameters, we must make sure that the data that determined those parameters is merely the training set. Otherwise, we add to the process information that is not available during prediction. Being careful with an algorithm’s parameters is a relatively common practice. We find significantly less caution in pre-processing procedures, but these pre-processing procedures might have a dramatic impact on the outcome.

5. Few levels of generalization

The objective of supervised learning is to generalize from the training samples. It is sometimes the case that there are several possible levels of generalization. For example, consider the task of gesture recognition based on some inertial sensors. The training set will include the sensory signals of several users, along with an annotation on the specific gesture that was recorded by each signal. Now, the purpose here is to generalize from these examples. There are two possible levels of generalization: a) generalizing from the specific gestures of a specific user into other gestures of the same user, and b) generalizing from the specific gestures of a specific user into general gestures of any other user. Data leakage due to a few levels of generalization occurs when we use data that is suitable for generalizations of a lower level to make generalizations in a higher level. For example, in the gesture recognition task, if we use signals by the same users in both the training and the testing, we let the personal user data affect the performance where such data will not be available.

Spotting Data Leakage

Spotting data leakage is one of the most misleading challenges. The concept seems to be clear and straightforward: in your research process, only use data for predictions that you will have at hand in the real prediction scenario. But, other than a few simple cases, leakage can take complex forms. Here are a few practices that will help you to be aware of and hopefully even avoid (or at least minimize) the impact of data leakage.

Decide on a benchmark model first and assess its performance: A benchmark model is a simple yet better-than-random prediction model. Thereafter, compare the performance of every supervised learning model to that of the benchmark model. Performance measures that are simply too good to be true often imply data leakage.
Be careful with training, validation, and testing: sparing an independent subset of the data for testing and validation is a recommended practice, of course. Split your dataset with extra care. Once you let your test set affect the trained model, you have data leakage. If you need an independent set for optimizing some parameters, you must use an additional independent set (train, validation, and test).
Pay attention to the features that appear to be the most important for your prediction: whether you are using feature selection techniques, or if your model includes an inherent indication of feature importance (e.g., decision trees or random forest), search for a good (or at least reasonable) explanation as to why that feature is important. If no such explanation exists, be suspicious.
Examine several performance measures: in too many projects we see a naive definition of performance measures, such as merely accuracy (or minimum MSE). Naive performance measures may hide local data leakages. For example, let us assume that, in a classification task, we have a brutal data leakage which affects only 10% of the population. Let us further assume that the objective accuracy (without the leakage) is 80%. The total accuracy due to the leakage will be 81%. Only slightly better and might seem like a success. Plotting an ROC curve for that example might reveal a perfect section, and we should always be suspicious of perfection.
Create variation in the set of explaining features. Specifically, examine what happens if you eliminate each one of the input features. If by eliminating just a single feature from the input set you experience a dramatic reduction in performance, there is a high chance that the eliminated feature contains leakage. When possible, try to eliminate a few features at a time. However, the more features you artificially eliminate, the lower the performance that you can expect will be.
In a time series, try to learn in different time resolutions and project on to different horizons. Be suspicious towards any surprising variations in performance.
Strive to get additional validation data: the train and test set that you are using for training and testing the model will probably describe the database as it is prior to the kickoff of the project. In the time that passes from that kickoff and until you obtain a model, it is most probable that new data was obtained. On that point, the flow of data in the production system is also probably better understood. Whenever possible, spare the time to collect another validation set, and do everything you can to make this set represent reality as it is, at the prediction point.

Data leakage is more common than one might think. It may affect the bottom line significantly, and spotting it can be difficult. Knowing the common leakage patterns and implementing the recommended practices above can save you from huge implementation efforts and save you from getting disappointing results in production.