Getting Started with scikit-learn Pipelines for Machine Learning

Building a pipeline from the ground up

Erin Hoffman

Published in

Analytics Vidhya

5 min readMar 26, 2020

A woman looks at her laptop while leaning on the glass in front of a row of computer servers — Image credit: https://www.wocintechchat.com/blog/wocintechphotos

(All code in this post is also included in this GitHub repository.)

Why Use Pipelines?

The typical overall machine learning workflow with scikit-learn looks something like this:

Load all data into X and y
Use X and y to perform a train-test split, creating X_train, X_test, y_train, and y_test
Fit preprocessors such as StandardScaler and SimpleImputer on X_train
Transform X_train using the fitted preprocessors, and perform any other preprocessing steps (such as dropping columns)
Create various models, tune hyperparameters, and pick a final model that is fit on the preprocessed X_train as well as y_train
Transform X_test using the fitted preprocessors, and perform any other preprocessing steps (such as dropping columns)
Evaluate the final model on the preprocessed X_test as well as y_test

Here is an example code snippet that follows these steps, using an antelope dataset (“antelope.csv”) from a statistics textbook. The goal is to predict the number of spring fawns based on the adult antelope population, annual precipitation, and winter severity. This is a very tiny dataset and should only be used for example purposes! This example skips any hyperparameter tuning, and simply fits a vanilla linear regression model on the preprocessed training data before evaluating it on the preprocessed testing data.

An example without pipelines

The train-test split is one of the most important components of a machine learning workflow. It helps a data scientist understand model performance, particularly in terms of overfitting. A proper train-test split means that we have to perform the preprocessing steps on the training data and testing data separately, so there is no “leakage” of information from the testing set into the training set.

But as a software developer looking at this code, an issue stands out immediately: steps 4 and 6 are virtually identical. Whatever happened to DRY (don’t repeat yourself)?! The solution: pipelines. Pipelines are designed to avoid this problem completely. You declare the preprocessing steps once, then you can apply them as needed to X_train as well as X_test.

First, Write the Code Without a Pipeline

Yes, you read that correctly. Until you are a real expert using pipelines, it’s a good idea to write out the repetitive/redundant version of your code first, then refactor it to use pipelines instead. If you are hoping to write functioning pipeline code, go back and create something that resembles the code snippet above first!

Second, Iteratively Add Preprocessing Steps

The error messages produced by pipelines can be extremely hard to decipher! So if you add more than one step at a time and something breaks, you’ll have a very hard time figuring out what broke. A better plan is to add steps one at a time, and double-check that it still works as you go.

My general strategy is to start with whatever step has dependencies, e.g. a SimpleImputer (since other preprocessing steps might fail if there is missing data). In this example case, let’s go ahead and start with the OneHotEncoder.

Let’s zoom in on some specifics here. First, fitting (#3 in the ML process). The old version was:

ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")ohe.fit(X_train[["winter_severity_index"]])

The new version is:

pipe = Pipeline(steps=[
    ("encode_winter", ColumnTransformer(transformers=[           
        ("ohe", OneHotEncoder(sparse=False, handle_unknown="ignore"), ["winter_severity_index"])
    ], remainder="passthrough"
))])
pipe.fit(X_train, y_train)

We still have the same encoder with the same parameters, but now it’s nested inside of a ColumnTransformer, which is nested inside of a Pipeline. Instead of subsetting X_train(with [[]]) to specify which column(s) to one-hot encode, we pass that into the ColumnTransformer. (See this post from my former student Allison Honold for more details on ColumnTransformers.) Then instead of using the encoder directly, we add it as the first “step” of the Pipeline.

Second, transforming (#4 and #6 in the ML process). The old version was:

train_winter_array = ohe.transform(X_train[["winter_severity_index"]])train_winter_df = pd.DataFrame(train_winter_array, index=X_train.index)X_train = pd.concat([train_winter_df, X_train], axis=1)X_train.drop("winter_severity_index", axis=1, inplace=True)test_winter_array = ohe.transform(X_test[["winter_severity_index"]])test_winter_df = pd.DataFrame(test_winter_array, index=X_test.index)X_test = pd.concat([test_winter_df, X_test], axis=1)X_test.drop("winter_severity_index", axis=1, inplace=True)

The new version is:

columns_with_ohe = [0, 1, 2, 3, "adult_antelope_population", "annual_precipitation"]X_train_array = pipe.transform(X_train)X_train = pd.DataFrame(X_train_array, columns=columns_with_ohe)X_test_array = pipe.transform(X_test)X_test = pd.DataFrame(X_test_array, columns=columns_with_ohe)

As you can see, we are already getting some benefit from using the pipeline. We no longer have to manually concat the encoded data with the original data, or manually drop the original column.

However at this point we have a bit of a “hack”, where we are hard-coding the column names so that the later code is able to work. We need the name of the “annual_precipitation” column in order to create the “low_precipitation” column, but the one-hot encoding has removed all of the column names. Let’s continue adding the preprocessing steps to the pipeline, and be sure to do the one-hot encoding after the custom transformation, so we don’t need this “hack” any more.

Third, Create Custom Transformers As Needed

For the purpose of feature engineering, we often want to use Pandas to do something that is not a common enough task to be included as a scikit-learn preprocessor like OneHotEncoder. To make that work in a pipeline, you need to create a custom transformer class.

Looking at the specifics again, the old version of fitting was…nothing. We weren’t using any information about the training data to perform the transformation. The old version of transforming was:

X_train["low_precipitation"] = [int(x < 12) for x in X_train["annual_precipitation"]]X_test["low_precipitation"] = [int(x < 12) for x in X_test["annual_precipitation"]]

The new version of fitting and transforming is that we added a new class PrecipitationTransformer:

class PrecipitationTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y):
        return self
    def transform(self, X, y=None):
        X_new = X.copy()
        X_new["low_precipitation"] = [int(x < 12) for x in X_new["annual_precipitation"]]
        return X_new

and a single “step” in the pipeline:

...
("transform_precip", PrecipitationTransformer()),
...

It’s not shorter, but it does avoid repetition!

Fourth, Add in Your Model

Adding the model as a final step is where, I think, the pipeline really shines. You add it in just the same way that you add the preprocessing steps:

...
("linreg_model", LinearRegression())
...

Here is the final workflow. We have somewhat reduced the number of lines of code, but more importantly we are no longer repeating anything!

Check out the blog post on ColumnTransformers mentioned previously, this example from scikit-learn, or this Medium post for more advanced examples.

Thanks for reading, and let me know in the comments if you have any questions!