Scikit-learn Pipelines for Beginners

A useful tool for streamlining the modeling process.

Garrett Keyes
Analytics Vidhya
4 min readDec 12, 2019

--

Source: scikit-learn.org

In its simplest definition pipelines in Scikit learn can be used to chain multiple estimators into together, documentation says. This can be very useful when conducting feature selection, normalization, and classification for three main reasons — data scientists only have to fit and predict a regression once, Scikit-learn hyper-parameter tuner GridSearch can be applied to multiple prediction modules, and pipelines they help prevent statistics from test data into trained data.

Source: Toward Data Science

Simply put, pipelines in Scikit-learn can be thought of as a means to automate the prediction process by using a given order of operations to apply selected procedures to predetermined models. Or run each regression and say which works best. But instead of talking about it, how about we see it in action!

When using pipelines in Scikit-learn it is important to note that the model should be optimized after finding the best hyper-parameter, or range of hyper-parameters you wish to use. And this is especially the case when using and exhaustive grid search.

Following the guide created by KDNuggest, a software and education website, first you start by importing Pipeline from the scikit learn library along with any other libraries you need.

After importing pipelines the first step is to create instantiate the pipelines you need for each model you plan to run. Pipelines take a list of tuples that can be read as an order of operations. The strings used in each tuple can be anything you want, they are just meant as names to identify the transformer and model you are using. The last item in the list of tuples, however, must be a model you are planning on using. Here I use XGBoost and Random Forest.

From there if you plan to use grid search you set the grid search parameters for each model you are using . An exhaustive GridSearch works by checking every combination of specified hyper-parameters to see which combination produces the high accuracy model. But while GridSearch is a powerful, it does require a long time to run depending on the regression model run used. And this can exponentially increase as more models or parameters are included in pipeline.

If a pipeline is being used to run regressions on changing sets of data its best to narrow the search parameters to what has historically produced the highest level of accuracy for the model. So for each hyper-parameter value to be checked more run time my be required for the model.

After setting the parameters to be used in grid search you instantiate each grid search. It’s important to set the estimator parameter to the model which you plan to use and the param_grid to that specific estimators parameter. Interestingly setting cv = 2 lets you perform a 2 fold cross validation for each model.

Next create a list of the different grid search estimators you plan to run. This will be useful later for iteration.

On KDNuggest it recommends creating a dictionary for each model in your pipeline. I chose to follow their lead and include it in my workflow and it proved useful later, along with the accuracy, best_model, and best_grid_search variables.

Finally create a loop to go through each grid in the previously specified list of grids.

Creating a Pipeline is certainly easier than rebuilding each model each time, but it can still be confusing. So following guides, blogs, and YouTube tutorials can be extremely helpful. Here is a list of the blogs and videos I looked at to help understand the process!

--

--