Machine Learning Plumbery: Data Leakage

We review what data leakage is, its different flavors, and show you a case of leakage that most people are not aware of.

Martín Villanueva
spikelab
7 min readJan 14, 2021

--

At Spike, we empower large organizations with the adoption of AI and data-driven innovation, building tailor-made solutions to complex problems using large data volumes and machine learning techniques, across multiple industries.

We build different kind of machine learning models to solve problems from industry. Whenever one of these models is intended to go to production, it is mandatory to assess its performance on an evaluation/test set.

Data leakage occurs when you [inadvertently] provide information about the evaluation set to any step of the model building pipeline.

Leakage is dangerous because produces an overly optimistic evaluation score, which can lead to a model that performs poorly on production, execution of bad decisions, and loss of trust from stakeholders.

Unfortunately, leakage is often subtle, indirect, and can take many forms, making it hard to detect and eliminate.

In the next sections of this article, we review some of the most common types of leakages. Also, we discuss a less known leakage that might surprise you.

Flavors of leakage

Leaky features

This happens when, at training stage, you use features that will not be available at inference time. In Spike, we call these features as future features.

In the example below there is a sales forecasting dataset. Here you have to predict the total sales of stores, and you have features like date attributes, temperature, and some others. Can you see the leaky feature?

It’s the temperature!. At inference time you will not have the temperature, so training and evaluating a model with the real values of temperature will result in leakage.

Fortunately, this type of leakage is very easy to recognize, as you will have missing features when trying to predict on new data.

Leakage at preprocessing

Another common case of leakage happens when we perform preprocessing on the full dataset before splitting it into a training set and an evaluation set.

In the example below, lines 5–14 shows an example of leaky standard scaling, while lines 16–22 shows how to do it the right way.

The problem here is that we computed the statistics of the first scaler with the full dataset, including the samples that will be used to evaluate the model, i.e., information of the evaluation set influences the scaling of the training samples… leakage!

This same example could happen with any type of preprocessing: target scaling, target encoding, imputation of missing values, dimensionality reduction, data augmentation, and so on.

As a general rule for avoiding this type of leakage, first split the dataset, and then build the preprocessing objects using only the training subset.

Leaky evaluation framework

When setting your evaluation framework, the model evaluation setup should equal the same conditions in which the model will be evaluated/deployed.

As example, let’s analyze this work by the research group of Andrew Ng:

The paper presented a novel ML algorithm to detect Pneumonia from chest X-rays, with outstanding state of the art results.

The dataset contained 112,120 images from 30,805 patients, thus there were ~3 images per patient. However, when splitting the data they did it randomly

Taken from https://arxiv.org/pdf/1711.05225v1.pdf

Do you see the problem? Because of the random split, images of a single patient could appear in both the training and validation set. Then, the network performance would be overestimated, since a patient could be present in both training and validation.

This model is meant to be used on new and unseen patients, so validating it with patients already seen is a leaky validation split.

This can be easily fixed by randomly partitioning the dataset by patient_id, which can be done using sklearn.model_selection.GroupShuffleSplit. Furthermore, scikit-learn implements sklearn.model_selection.GroupKFold which ensures that no samples of the same group exist in more than one fold.

Wrong evaluation setups are also common in time series problems if you perform a random sampling rather than a time-aware split. It is much easier to fill out the values of the target in the time gaps than predicting the target for a consecutive sequence of time steps into the future.

For an in-depth review of the common cases of data leakage, we recommend watching Target Leakage in Machine Learning by Yuriy Guts.

Leakage by early stopping

There exists a variety of model classes that are trained iteratively, i.e., at each iteration the model is presented with the training data and adjusts its structure in small steps until reaching a local optima. Neural Networks and Gradient Boosting Machines are two examples of these.

For this class of models, it is fundamental to stop training at a certain iteration to avoid overfitting the training set. A common technique to prevent this is early stopping:

You train the model on the training set and keep track of the error on a validation set. When the error stops decreasing in the validation set for a given number of iterations, the learning algorithm stops.

It is very common to use early stopping jointly with K-fold cross validation splitting to estimate the generalization capacity of the model. Let’s see how to do it with the following house price prediction dataset.

In the example below, we use LightGBM to train models using early stopping in a 5-fold split scheme. For each fold, we save the error at the best iteration, and the error reported is the average of these errors across the folds.

Above you can see that the CV RMSE is 250.98 and the best iterations differ a lot between the folds: [653, 157, 75, 62, 2751].

If you report this error as the generalization capacity of the model, then there would be data leakage! Can you see it?

In this example, for each fold we used the evaluation data to decide at which iteration to stop, i.e., the evaluation data influenced the training process, and so there is a leakage!. Because in each fold the model stopped training at the point where scored the best in the evaluation data, this evaluation score is overestimated and unrealistic.

In the reality, you will not know at which iteration the model will score best in the new unseen data

The problem here is not early stopping, but how do we use it to report the model capacity. Next, we present two ways for still using early stopping and avoid this type of leakage.

Nested cross validation

Nested cross-validation is a technique to perform hyperparameter tuning and model evaluation that overcomes the problem of overfitting the training dataset.

The idea is fairly simple and is summarized in the image below. You first make a standard k-fold split of the dataset, but this time the test folds will be completely hidden from the model, this is called the outer loop. Then for each training folds, you will make another k-fold split (in the example a 2-fold split), this is called the inner loop.

The hyperparameters of the model are tuned in the inner loop, and the outer loop is used to train the model (with optimal hyperparameters) on the full training folds, and then assess it on the hidden test folds.

Image taken from: https://sebastianraschka.com/faq/docs/evaluate-a-model.html

In the example below, we apply nested cross validation on the house price prediction problem. The purpose of the inner loop (lines 10–28) is to tune the number of iterations. In the outer loop, the model is trained on the full training folds, with the tuned number of iterations and without early stopping (lines 33–39), and thus there is no leakage when calculating the RMSE on the test folds (lines 41–43).

As expected, the RMSE has increased from 250.98 to 284.08. Our evaluation methodology is leak-free and will be closer to the error of the model on new and unseen data.

Independent cross validation

Nested cross validation is a robust option for assessing the generalization capacity of a model, however, it has a drawback: it is computationally expensive. For a k-fold split schema you need to train k x k models (h x k x k if you tune h hyperparameters), which might be really costly for a big model on a huge dataset.

One computationally cheaper alternative is independent cross validation. The idea is really simple:

You first make a k-fold split to find an approximately optimal value for the number of iterations and then evaluate the models on a different k-fold split.

Below you can see how to apply independent cross validation on our problem. The first k-fold split (lines 3–23) is only to find the best iteration on each fold and obtain an approximate value for the best iteration. In the second k-fold split (lines 26–44) the models are trained over each training folds for a fixed number of iterations and without early stopping. Again, our evaluation procedure (lines 42–44) is now leak-free.

Note that for obtaining different k-fold splits, the only thing you need to change is the random seed! (lines 3 and 26).

The RMSE has increased from 250.98 to 270.39. If you require a more robust error estimation, you can repeat this procedure N times (changing the seeds) and averaging the RMSE obtained in each repetition. This methodology is known as repeated cross validation.

Conclusion

Data leakage is a serious issue when assessing ML models before going to production. When present, it produces an overestimation of the generalization capacity of the model. How big is the overestimation depends on the type and severity of the leakage.

In this article we showed how using early stopping can lead to a leaky assessment of the model. Also, we presented two workarounds that we practice at Spike for using early stopping and still get a robust evaluation score for the model.

For more details and the full code examples, visit our Github repository.

--

--