# Clean Data Science workflow with Sklearn Pipeline

Pipelines are a container of steps, they are used to package workflow and fit a model into a single object. Pipelines are stacked on top of one another, taking input from one block sending output to the next block, the next block takes the input and gives out an output. Simple as that!

Imagine having a pipeline with two steps (i)Normalizer and (ii)LinearRegression model. The data will first be passed to the Normalizer block which will transform the data and send it to the LinearRegression model which will fit the model with the data from the Normalizer Block. The Linear Regression can then train the model and make predictions. *The data source remains unchanged when working with Pipelines, the data transformation changes are done in memory so the data source remains intact.*

Pipelines are initialized by passing a list of tuples in the form of **((name, task()))** → **((‘Norm’, Normalizer())** or **((‘LR’, LinearRegression())** Some of the basic steps that can be performed by Pipelines:

- Transformer

transformation(x,[y]) applies transformation on X

2. Estimator

fit(X,Y) fits the model object on the data

3. Predictions

predict(X_test) predicts the results

`#importing libraries`

from sklearn.datasets import load_boston

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA,KernelPCA

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import cross_val_score,GridSearchCV

from sklearn.metrics import mean_squared_error

from sklearn.pipeline import Pipeline

from sklearn.pipeline import FeatureUnion

import pandas as pd

import numpy as np

**Transformation, Estimator and Prediction**

For example, I am going to use **Boston Dataset**. In the first step, we will transform the data using **StandardScaler** and then fit the data into a **RandomForestRegressor**. For the final step, we will pass the model into a **cross-val-score**.

X=load_boston()

df=pd.DataFrame(X.data,columns=X.feature_names)

y=X.target#initializing and defining a Pipelinepipe=[]

pipe.append(('SC',StandardScaler()))

pipe.append(('rfr',RandomForestRegressor(n_estimators=200)))

model=Pipeline(pipe)#cross val scorecv_results = cross_val_score(model, df, y, cv=5)

msg = "%s: %f (%f)" % ('Pipeline', cv_results.mean(), cv_results.std())

print(msg)

We can add n-number of steps to the pipeline as we wish. Let’s add **PCA** to the pipeline between **StandardScaler** and **RandomForest**.

`pipe=[]`

pipe.append(('SC',StandardScaler())) #step 0

pipe.append(('pca', PCA(n_components=8))) #step 1

pipe.append(('rfr',RandomForestRegressor())) #step 2

model=Pipeline(pipe)

We can access separate steps of the pipeline using **model.steps[0]**

We can access all the steps of the pipeline using **model.named_steps**

**2. Grid Search CV with Pipelines**

GridSearchCV as a part of pipeline can save you from writing a lot of unwanted code, Pipeline can be very useful if you want to perform multiple hyper-parameters tuning in your project. From the pipeline example created above, we can perform two hyper-parameter searches (i)PCA and (ii) RandomForestRegressor.

# defining the params for hyper-parameters

params = dict(pca__n_components=[2, 5, 10],

rfr__n_estimators=[100,200,300,400],

rfr__max_depth=[2,4,6,8,10])#Grid Search CV

gcv=GridSearchCV(model,param_grid=params,n_jobs=-1).fit(df,y)gcv.best_params_{'pca__n_components': 10, 'rfr__max_depth': 8, 'rfr__n_estimators': 200}

3.Feature Union

Feature union are horizontal pipelines that concatenate the outputs of several transformers into one. During the fitting, each of these is fit to the data independently. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix. We can use **PCA** and **kernel PCA** to transform the data and get a final output as a mix of both decompositions.

pipe=[]

pipe.append(('kernel_pca',KernelPCA(n_components=5)))

pipe.append(('pca',PCA(n_components=5)))

model=FeatureUnion(pipe)model.fit(df,y).shape(506, 10)

The result is a combination of 5 components from PCA and 5 components from the kernel PCA, concatenated as one. We can try any numbers of combinations like [PCA, SelectKBest] the best part is the code will look clean and presentable. If you want to take out a part and replace it with other, it's very modular. Check out this link for a more complex example of Pipelines.

Advantages of Pipelines:

- Easy to write. Write it once, use it many times.
- Easy to swap pieces.
- Not messy, clean and presentable.
- Keeps all intermediate steps together.