Scikit-learn pipelines and Cross validation

ML Pipelines using scikit-learn and GridSearchCV

Managing ML workflows with Pipelines and using GridSearch Cross validation techniques for parameter tuning

Nikhil pentapalli
Analytics Vidhya
Published in
6 min readJun 7, 2020


Image from Unsplash

ML calculations and algorithms generally process enormous information. A pipeline is an approach to chain those information handling ventures as required in an organized manner. Not just that,it likewise helps in making incredible work process and Reproducible code.

What is a ML Pipeline?

A pipeline is a progression of steps where information is changed. It originates from the "pipe and filter" plan design. In this way, you may have a class for each filter and afterward another class to join those means into the pipeline and make a complete final pipeline.

Methods of a Scikit-Learn Pipeline

Pipelines must have those two methods:

  • The word “fit” is to learn on the data and acquire its state
  • The word “transform” (or “predict”) to actually process the data and generate a prediction.

It’s also possible to call this method or to chain both of them:

  • The word “fit_transform” is to fit and then transform the data, but all in one go, which allows for great code optimizations when these two methods must be done simultaneously.

First, we shall define the model pipelines and then we do Grid search cross validation technique to find the optimal model for our problem statement.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Here I am using the bbc news dataset from kaggle for building the pipelines.

sample data
Value Counts of each category
sns.countplot(x=”category”, data=news)
Plot of each category

Logistic Regression Classifier

x_train,x_test,y_train,y_test = train_test_split(news[‘text’], news.category, test_size=0.2, random_state=2020)pipe_lr = Pipeline([(‘vect’, CountVectorizer()),
(‘tfidf’, TfidfTransformer()),
(‘model’, LogisticRegression())])
model =, y_train)
prediction = model.predict(x_test)
print(“\naccuracy: {}%”.format(round(accuracy_score(y_test, prediction)*100,2)))
print('\n',confusion_matrix(y_test, prediction))print('\n',classification_report(y_test, prediction))
complete report of the model

We imported pipeline from scikit-learn library.In the pipeline we can define the function that has to be performed in a sequence.Here we first split our data into train,test using train_test_split.

Then we defined CountVectorizer, Tf-Idf, Logistic regression in an order in our pipeline.This way it reduces the amount of code and pipelining the model helps in comparing it with different models and getting an optimal model for our choice.

Similarly let’s create pipelines for different models with the same data as input and select the best model out of everything.

Creation of different Pipelines:

In the above code i constructed four pipelines for Four different models.

  1. Logistic Regression
  2. Decision Tree
  3. Random Forest
  4. Support Vector Machine

You can absolutely change the architecture of the single pipeline. For example, in Logistic Regression if you feel the target values needs to be scaled you can use standard scaler or you can absolutely use some preprocessing techniques.

Let us assume a case where preprocessing is needed and how to handle such case.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
num_transformer = Pipeline([('imputer',SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])
cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),('onehot',OneHotEncoder(handle_unknown='ignore'))])

So i have created two separate pipelines to handle different types of data that is num_transformer handles the missing numbers using simple imputer while the standard scaler will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

In cat_transformer i have used simple imputer and one hot encoder to encode the categories.

Now let us take the columns of the numerical features as well as categorical features and apply the pipeline of preprocessing steps described above.To do that i take the train_data and select the specific columns as shown below.

num_features = train_data.select_dtypes(include=[‘int64’).columns
cat_features = train_data.select_dtypes(include=[‘object’]).columns

Once we are ready we can now push everything into our column Transformer which can be imported from sklearn.compose

from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)])

So Now unlike above i just did all the preprocessing in one step which can be used for multiple models.

pipe_rf = Pipeline([(‘preprocess’,preprocessor),
(‘clf’, RandomForestClassifier(random_state=42))])

Let us now fit the models using GridSearchCV which helps us in model selection by passing many different params for each pipeline and getting the best model as well as best params with which the model was fit using. So let’s get started by defining some params for grid search.

Linear Regression takes l2 penalty by i would like to experiment with l1 penalty.Similarly for Random forest in the selection criterion i could want to experiment on both ‘gini’ and ‘entropy’. So i have passed both the values in to the clf_criterion

You can also experiment on different kernels of your choice for the specific problem statement and use case.


Performing model optimizations...

Estimator: Logistic Regression
Best params: {'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__solver': 'liblinear'}
Best training accuracy: 0.969
Test set accuracy score for best params: 0.966

Estimator: Random Forest
Best params: {'clf__criterion': 'gini', 'clf__max_depth': 10, 'clf__min_samples_split': 10}
Best training accuracy: 0.844
Test set accuracy score for best params: 0.836

Estimator: Support Vector Machine
Best params: {'clf__C': 9, 'clf__kernel': 'linear'}
Best training accuracy: 0.978
Test set accuracy score for best params: 0.971

Classifier with best test set accuracy: Support Vector Machine

Saved Support Vector Machine grid search pipeline to file: best_grid_search_pipeline.pkl

Now as i have compared Logistic Regression, Random Forest and SVM in which i could definitely see that SVM is the best model with an accuracy of 0.978 .we also obtained the best parameters from the Grid Search cross validation.

So i have taken accuracy as a scoring parameter. It is completely based on the problem statement.You can take scoring parameter as recall or f1-score or precision which are completely derived from confusion matrix

Finally, Grid search builds a model for every combination of hyper parameters specified and evaluates each model. Another efficient technique for hyper parameter tuning is the Randomized search — where random combinations of the hyper parameters are used to find the best solution.

The Confusion matrix and the scoring parameters can be understood from the below images.

Confusion Matrix(image by author)
Scoring Parameters(image by author)

You can define your own custom transformers which should definitely contain fit and transform methods in it.if you had noticed i had left the column transformer for you to try on your own.

Voila, there you go. Now you can try building your own machine learning pipeline and do not forget to try a custom transformer. Also change the parameters in the grid search and run experiments. The best way to learn anything is by experimenting with it.

I hope this article empowers your knowledge. Keep supporting and Happy Learning.



Nikhil pentapalli
Analytics Vidhya

Data Scientist,Machine learning Engineer|Love to Share the knowledge and empower data science enthusiasts.