Clean Data Science workflow with Sklearn Pipeline

Published in

Analytics Vidhya

3 min readApr 30, 2020

Pipelines are a container of steps, they are used to package workflow and fit a model into a single object. Pipelines are stacked on top of one another, taking input from one block sending output to the next block, the next block takes the input and gives out an output. Simple as that!

Imagine having a pipeline with two steps (i)Normalizer and (ii)LinearRegression model. The data will first be passed to the Normalizer block which will transform the data and send it to the LinearRegression model which will fit the model with the data from the Normalizer Block. The Linear Regression can then train the model and make predictions. The data source remains unchanged when working with Pipelines, the data transformation changes are done in memory so the data source remains intact.

Pipelines are initialized by passing a list of tuples in the form of ((name, task())) → ((‘Norm’, Normalizer()) or ((‘LR’, LinearRegression()) Some of the basic steps that can be performed by Pipelines:

Transformer

transformation(x,[y]) applies transformation on X

2. Estimator

fit(X,Y) fits the model object on the data

3. Predictions

predict(X_test) predicts the results

#importing libraries
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA,KernelPCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score,GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
import pandas as pd
import numpy as np

Transformation, Estimator and Prediction

For example, I am going to use Boston Dataset. In the first step, we will transform the data using StandardScaler and then fit the data into a RandomForestRegressor. For the final step, we will pass the model into a cross-val-score.

X=load_boston()
df=pd.DataFrame(X.data,columns=X.feature_names)
y=X.target#initializing and defining a Pipeline
pipe=[] 
pipe.append(('SC',StandardScaler()))
pipe.append(('rfr',RandomForestRegressor(n_estimators=200)))
model=Pipeline(pipe)#cross val score 
cv_results = cross_val_score(model, df, y, cv=5)
msg = "%s: %f (%f)" % ('Pipeline', cv_results.mean(), cv_results.std())
print(msg)

We can add n-number of steps to the pipeline as we wish. Let’s add PCA to the pipeline between StandardScaler and RandomForest.

pipe=[]
pipe.append(('SC',StandardScaler())) #step 0
pipe.append(('pca', PCA(n_components=8))) #step 1
pipe.append(('rfr',RandomForestRegressor())) #step 2
model=Pipeline(pipe)

We can access separate steps of the pipeline using model.steps[0]

We can access all the steps of the pipeline using model.named_steps

2. Grid Search CV with Pipelines

GridSearchCV as a part of pipeline can save you from writing a lot of unwanted code, Pipeline can be very useful if you want to perform multiple hyper-parameters tuning in your project. From the pipeline example created above, we can perform two hyper-parameter searches (i)PCA and (ii) RandomForestRegressor.

# defining the params for hyper-parameters
params = dict(pca__n_components=[2, 5, 10],
                  rfr__n_estimators=[100,200,300,400],
                  rfr__max_depth=[2,4,6,8,10])#Grid Search CV
gcv=GridSearchCV(model,param_grid=params,n_jobs=-1).fit(df,y)gcv.best_params_
{'pca__n_components': 10, 'rfr__max_depth': 8, 'rfr__n_estimators': 200}

3.Feature Union

Feature union are horizontal pipelines that concatenate the outputs of several transformers into one. During the fitting, each of these is fit to the data independently. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix. We can use PCA and kernel PCA to transform the data and get a final output as a mix of both decompositions.

pipe=[]
pipe.append(('kernel_pca',KernelPCA(n_components=5)))
pipe.append(('pca',PCA(n_components=5)))
model=FeatureUnion(pipe)model.fit(df,y).shape
(506, 10)

The result is a combination of 5 components from PCA and 5 components from the kernel PCA, concatenated as one. We can try any numbers of combinations like [PCA, SelectKBest] the best part is the code will look clean and presentable. If you want to take out a part and replace it with other, it's very modular. Check out this link for a more complex example of Pipelines.

Advantages of Pipelines:

Easy to write. Write it once, use it many times.
Easy to swap pieces.
Not messy, clean and presentable.
Keeps all intermediate steps together.

Clean Data Science workflow with Sklearn Pipeline

Written by Nitin