Avoiding Leakage with Pipelines

Svitlana Glibova
Analytics Vidhya
Published in
6 min readFeb 16, 2021

What is a pipeline, anyways?

In the realm of data science, a pipeline is a process of standardizing and extracting your data in a reproducible way, although it is more of a concept than a formula. In general, pipelines are a workflow. Specifically, however, Sci-Kit Learn’s Pipeline module is an efficient way of automating good practices of model validation with pre-processing in Python. If you write code in R, this can be done with the caret package, although the code examples here may not be as useful to you. Pipeline allows the user to package many of the valuable methods found in Sci-Kit Learn’s other modules into a workflow that can be implemented and repeated on many datasets and reduces the potential for errors that can lead to inaccurate modeling and training, especially when working with large sets. Not only can they streamline your work, but Pipelines can also be used as a preventive measure against data leakage.

Sci-Kit Learn (also known as sklearn) is a project that began in 2007 by David Cournapeau, with contributing work done by Matthieu Brucher who joined the project later that year. It was first officially released on February 1st, 2010 and is currently a community-developed library with an international user and contributor base. Sklearn is designed to operate on top of the SciPy and NumPy Python libraries, but is also highly functional with Pandas, Matplotlib, and Plotly. Currently, it is one of the standards for predictive analysis.

What is data leakage and why is it bad?
Data leakage “occurs when information that would not be available at prediction time is used when building the model,” which “results in optimistic performance estimates … thus poorer performance when the model is used on actually novel data, for example during production” (Sklearn Common Pitfalls). The cardinal rule is to never introduce your test data to the feature extraction and transformation steps that you apply to your training data. As a corollary to this, performing train/test splits or other forms of model validation before feature selection and model estimation ensures that data leakage does not occur and has no impact on your testing data (which as nice as optimism sounds, this leads to inaccuracies and in the case of predictive modeling, I would much rather be a precise skeptic). A few of the included links detail model validation further, but the general idea of this is splitting your data into sets for training and sets for testing — one set is the model’s basis and the other is used to validate whether the model can accurately predict unknown outcomes.

Introducing testing data to training data transformations muddies an algorithm’s ability to accurately predict unseen data because it has been exposed to the training structure, resulting in poor model validation in a self-reinforcing loop. This loop occurs when training information bleeds over into the testing data, which biases the outcome towards the training set. In turn, this makes predicting unknown information biased and takes the model further and further from accurate prediction. Suddenly, your model has spiraled out of control and you are left to start all over again. Data leakage can, on a small scale, lead to inaccurate predictive models and on a large scale, invalidate them altogether. And in a professional setting, it could delay deployment and create serious security breaches.

Because this may be challenging to visualize in a conceptual way, below is a brief example of data leakage in the wild:

“A feature is used to train the model that would not be available in production at the time of prediction. An example of this might be using the number of oral medications a patient is currently taking to predict length of stay at admission when medication reconciliation may not take place for up to 24 hours following admission.”

This healthcare example of leakage occurs by using information that does not exist in the original set at the time of prediction for calculating a target, which is not effective because it applies not-yet-accessible information to determine future outcomes.

So how can a Pipeline solve this?
The Pipeline module allows you to package feature selection, extraction, and estimation, as well as helps ensure that you are only operating on training data — in the above example, a pipeline could have been used to select data that is only generated before the time of prediction.

Pipelines also allow for you to only have to call .fit() and .predict() once after the selection/standardization sequence is performed rather than repeating the process individually each time. Aside from streamlining data mining processes, it provides an extra layer of protection against leakage by removing the potential for human error associated with hard-coding for each independent instance.

When would constructing a pipeline be beneficial?

Preprocessing
A particularly useful module to begin using for constructing pipelines is sklearn.preprocessing — its scaling and normalization features assist in standardizing data to help increase model accuracy. Once you have cleaned your data, normalizing can coerce it into a Gaussian (normal) distribution with a mean of zero and a unit variance, which is a necessity for (and an assumption of) many machine-learning algorithms. Without scaling data, features that are weighted differently can overshadow other features and render the algorithm unable to accurately predict data from the testing subset. The Preprocessing module is replete with useful tools beyond scaling, such as binarization and centering — visit the documentation page of sklearn.preprocessing for more features.

Constructing a pipeline for standardization:
Constructing a pipeline involves using a list of (key, value) pairs, with key being a string name for the estimator step and value being an estimator object. An example could look like this:

>>> from sklearn.pipeline import Pipeline>>> from sklearn.preprocessing import StandardScaler>>> from sklearn.linear_model import LinearRegression>>> estimators = [('scale', StandardScaler()), ('linreg', LinearRegression())] #contains a list of scaling and estimator objects>>> pipe = Pipeline(estimators) #constructs a pipeline using the objects>>> pipePipeline(steps=[(‘scale’ , StandardScaler()), (‘linreg’, LinearRegression())]) #the output is a pipeline object containing the above methods

A shorthand way of accomplishing this (if you really don’t feel like naming your own objects) is by calling make_pipeline():

>>> from sklearn.pipeline import make_pipeline>>> from sklearn.naive_bayes import MultinomialNB>>> from sklearn.preprocessing import Binarizer>>> make_pipeline(Binarizer(), MultinomialNB()) #initializes a Pipeline objectPipeline(steps=[(‘binarizer’, Binarizer()), (‘multinomialnb’, MultinomialNB())]) #the contents of the Pipeline object

Constructing a pipeline for feature selection:
For selecting features that meet certain statistical criteria, sklearn.feature_selection tools can be used as part of the data pre-processing step and integrated into a pipeline. For example:

clf = Pipeline([(‘feature_selection’, SelectFromModel(LinearSVC(penalty=”l1"))),(‘classification’, RandomForestClassifier())])clf.fit(X, y)

This Sklearn code snippet demonstrates initializing a pipeline object called clf which contains a feature_selection tool and an algorithmic classification tool that will be implemented on the selected features. Then, calling clf.fit() on data X and y will fit a model to the selected and modified data.

When it comes to training machine learning algorithms, the process can become overwhelming, convoluted, and imprecise without the right set of tools to organize your workflow. Using a pipeline to structure how information is both processed and modeled can help reduce the chance of making a careless error and safeguard against creating poorly trained models. Managing large sets of data is already a challenging enough task without the many roadblocks that can be hit, so it is important to understand the options for avoiding them without sacrificing your sanity.

For more information on inferential and predictive analysis, visit:

https://scikit-learn.org/
https://towardsdatascience.com
https://machinelearningmastery.com/

--

--

Svitlana Glibova
Analytics Vidhya

Python Engineer at Mantium | Developer Relations | Data Science | | B.S. in Mathematics | Former Certified Sommelier | Seattle, WA