How to prevent Data Leakage problem: Illustrated with Practical Examples

Bakhtiyarsalah
CodeX
Published in
7 min readMay 21, 2024

Data leakage occurs when knowledge of the hold-out test set leaks into the dataset used to train the model, leading to inaccurate performance estimates. Essentially, it happens when information from outside a desired training data set helps to create a model.

For example, A student inadvertently having answers while studying for a test may ace the exam without genuinely understanding the material. Similarly, a model affected by data leakage will perform exceptionally well during training but might fail in real-world applications. For businesses, this can result in misplaced confidence, unexpected outcomes, and potential financial losses.

So, when you test your model, it might perform really well because it’s essentially just memorized the answer from the training data, rather than truly understanding the relationship between the input factors and the outcome. In real-world scenarios, this could lead to inaccurate predictions and unreliable models.

In this blog, you will discover how to to implement data preparation without data leakage.

Note that the purpose here is not to get better models, but to have models that give us a more accurate estimate of how the model will work on unseen data.

How Does Data Leakage Happen?

Data leakage in machine learning can happen in various ways during the data handling and preparation stage.

  1. Information from the Future

Let’s consider a scenario where a bank is developing a machine learning model to predict credit risk for loan applicants. The bank has historical data on past loan applications, including information about applicants’ demographics, credit scores, employment status, loan amounts, and whether the applicants defaulted on their loans.

In an attempt to improve the model’s performance, a data scientist at the bank decides to include a new feature: the current bank account balance of each applicant. The rationale is that applicants with higher bank balances might be less likely to default on their loans.

However, the data scientist makes a critical mistake by including the current bank account balance as a feature without considering the timing of when this information becomes available in the loan application process.

The bank account balance is often only assessed after the loan is approved. In other words, it’s a post-decision variable rather than a predictive variable. As a result, including this feature in the model introduces data leakage.

2. Data transformation, Scaling or normalizing the entire data set before splitting it, you risk unintentionally mixing information.

A common approach is to first apply one or more transforms to the entire dataset. Then the dataset is split into train and test sets or k-fold cross-validation is used to fit and evaluate a machine learning model.

  1. Prepare Dataset
  2. Split Data
  3. Evaluate Models

Although this is a common approach, it is dangerously incorrect in most cases. We get data leakage by applying data preparation techniques to the entire dataset. This is not a direct type of data leakage, where we would train the model on the test dataset. Instead, it is an indirect type of data leakage, where some knowledge about the test dataset, captured in summary statistics is available to the model during training.

For example, consider the case where we want to normalize data, that is scale input variables to the range 0–1. When we normalize the input variables, this requires that we first calculate the minimum and maximum values for each variable before using these values to scale the variables. The dataset is then split into train and test datasets, but the examples in the training dataset know something about the data in the test dataset, they have been scaled by the global minimum and maximum values, so they know more about the global distribution of the variable then they should.

We get the same type of leakage with almost all data preparation techniques.

The solution is straightforward. Data preparation must be fit on the training dataset only.

  1. Split Data.
  2. Fit Data Preparation on Training Dataset.
  3. Apply Data Preparation to Train and Test Datasets.
  4. Evaluate Models.

More generally, The entire modeling pipeline must be prepared only on the training dataset to avoid data leakage. This might include data transforms, but also other techniques such feature selection, dimensionality reduction, feature engineering and more. For example, creating new features from the complete data set before its division can embed insights from the test data into the training data, potentially leading to data leakage.

So, let’s evaluate a logistic regression model using train and test sets on a synthetic binary classification dataset where the input variables have been normalized.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
# Generating synthetic data
X, y = make_classification(n_samples=10000, n_features=5, n_classes=2, random_state=42)
# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# Define the scaler
scaler = MinMaxScaler()
# Fit and transform on the entire dataset before split
# causing data leakage
X = scaler.fit_transform(X)
# Split the scaled data into train and test sets
X_train_leakage, X_test_leakage, y_train_leakage, y_test_leakage = train_test_split(X, y, test_size=0.33, random_state=1)
# Fit the model with the leaked data
model_leakage = LogisticRegression()
model_leakage.fit(X_train_leakage, y_train_leakage)
# Evaluate the model with the leaked data
yhat_leakage = model_leakage.predict(X_test_leakage)
# Evaluate predictions
accuracy_leakage = accuracy_score(y_test_leakage, yhat_leakage)
print('Model Performance with Data Leakage - Accuracy: %.3f' % (accuracy_leakage*100))
print(classification_report(y_test_leakage, yhat_leakage))

In this case, we can see that the classification_report for the model below with data leakage.

Now, We will define the MinMaxScaler and call the fit() function on the training set, then apply the transform() function on the train and test sets to create a normalized version of each dataset.

# Generating synthetic data
X, y = make_classification(n_samples=10000, n_features=5, n_classes=2, random_state=42)
# Splitting into train and test sets without leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# Define the scaler
scaler = MinMaxScaler()
# Fit and transform on the training dataset only
scaler.fit(X_train)
# Scale the training dataset
X_train = scaler.transform(X_train)
# Scale the test dataset
X_test = scaler.transform(X_test)
# Fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model
yhat = model.predict(X_test)
# Evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('\nModel Performance without Data Leakage - Accuracy: %.3f' % (accuracy*100))
print(classification_report(y_test, yhat))

In this scenario, we observe that the model’s accuracy and precision are slightly higher when there’s data leakage compared to the accurate model. Data leakage typically leads to overestimation of model performance, presenting an optimistic view, such as better performance.

2. Data leakage problem in train/test split method.

There are various methods for dividing data into training and testing sets, including random train-test splitting, time-based splitting, and stratified sampling. While random splitting is commonly used, it can result in data leakage. When randomly splitting a fixed dataset, each fold will have a similar distribution, potentially leading to the same problem of test data influencing the training process.

Indeed, in real life, there is always some dimension (let’s call it stratifying variable) based on which the new data is different from the training data, for example:

  • time,
  • space,
  • some business dimension,
  • Customer ID
  • often, a combination of the previous ones.

For example, imagine you’re working with bank data where each customer can have multiple associated records. If we randomly split this data into training and testing sets, there’s a risk that the same customer ID may appear in both sets. This introduces data leakage, where information from the testing set inadvertently influences the training process, leading to overly optimistic performance estimates and unreliable models.

To prevent data leakage in such cases, it’s essential to implement alternative splitting strategies:

  1. Group-Based Splitting: Instead of randomly dividing the data, consider splitting it based on groups or categories. For instance, ensure that all data related to a particular customer ID resides either in the training set or the testing set, but not in both. This way, you maintain the independence between the training and testing data, preventing data leakage.
  2. Temporal Splitting: If your data has a temporal aspect, such as time series data or transactional data over time, utilize temporal splitting. Train your model on data from earlier time periods and evaluate it on data from later time periods. This ensures that the model’s performance is assessed on its ability to generalize to future data, mimicking real-world scenarios more accurately.

3. Data leakage in Cross-Validation

Both incorrect transformations and the traditional train-test split method can pose issues when applying the cross-validation technique.

As highlighted earlier, random splitting of the dataset into folds can lead to data leakage. A solution to this is employing group K-fold cross-validation. cross-validation, where each fold comprises data from distinct groups or categories. This ensures independence across folds, preventing leakage.

But data transformation without data leakage when using cross-validation is slightly more challenging. It requires that the data preparation method is prepared on the training set and applied to the train and test sets within the cross-validation procedure, e.g. the groups of folds of rows. We can achieve this by defining a modeling pipeline that defines a sequence of data preparation steps to perform and ending in the model to fit and evaluate.

This can be achieved using the Pipeline class. This class takes a list of steps that define the pipeline.

# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# We can then pass the configured object to the cross val score() function for evaluation
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

Running the example normalizes the data correctly within the cross-validation folds of the evaluation procedure to avoid data leakage.

References:

https://builtin.com/machine-learning/data-leakage

https://towardsdatascience.com/why-you-should-never-use-cross-v...

Book: Data Preparation for Machine Learning Data Cleaning, Feature Selection, and Data Transforms in Python (Jason Brownlee) (z-lib.org)

--

--

Bakhtiyarsalah
CodeX
Writer for

Data Scientist | Model Validator at PASHA BANK OJSC with a solid background in credit risk analysis and data science.