ML Pipeline

Nallaperumal
Analytics Vidhya
Published in
10 min readAug 16, 2020

To create workflow and maintain the reproducibility in data set

Image credits: gifer.com

In this post we will see what is pipeline, why it is essential and what are the versions of pipelines that are available.

What is pipeline and why is it necessary in ml ?

For any machine learning models it is necessary to maintain the workflow and the data set. For any process that requires repetition — say Imputation for missing values,scaling for continuous variables or be it encoding for categorical variables or going for a CountVectorizer for texts.

Image source: gify.com

Pipeline — It helps us to simplify the process and maintain the order of execution for our machine learning models. This pipeline can be used to integrate multiple steps into one, so that the data will go through sequence of steps.Thus instead of calling the above steps separately the pipeline concatenates all of the steps into one specific system ,enhances the readability and reproducibility.

According to sklearn — definition of Pipeline:

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

To create above pipelines we can make use make_pipeline function which is available in the sklearn library in the latest version and was previously referred to as Pipeline (in the earlier versions)

Why to go for Pipelines ?

  1. When a series of transformations have to be performed in a systematic manner whenever an unknown or test data comes in.

2. When a category which is not expected comes into the test set.

3. When there are missing values in the dataset (both test & train set).

If you want to process different data types and to synchronize them and to follow the order all the time a pipeline is more preferred.

We call each of the transformations as a named step in pipeline.

Let us look into some minor differences between make_pipeline and Pipeline

Difference between Pipeline and make_pipeline

Structure of Pipeline and make_pipeline: both are in the form of tuples.

Let us see their notation with a random example:

If you see the above screenshot, you will be able to differentiate between the two.

Now let us jump into more details of Pipelines and its transformations.

Time for transformation — Pic credits: giphy.com

Pipeline introduced a sub-part of it which is called Column Transformer

column transformers structure in Pipeline

(‘Name of the transformation’,Some_transformer(parameters-if any),column_name)

Just an example to illustrate the structure

In the above definition the name of the transformation is mandatory if we have decided to go for Pipeline and its column transformation(it needs to be explicit) but in case of the make_pipeline & its column transformer (it is not mandatory — it is implicit).

column transformers structure in make_pipeline

(Some_transformer(parameters-if any),column_name)

Just an example to illustrate the structure

Column Transformer : This is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones.Here you can specifically mention which column needs what transformation differentiating between the categorical and continuous ones and what transformation needs to be done accordingly.

This applies transformers to columns of an array or pandas Data Frame.

Column transformer notation:

Usually it is in the form of a tuple

Notation:

sklearn.compose.make_column_transformer(*transformers, **kwargs)

The above syntax is taken from sklearn library

General flow for a column Transformer:

Typical flow for a Column Transformer

Also we can pass a pipeline to a column transformer.

If there are multiple processes which are involved, we can pass the column transformer into Pipeline then integrate this with column transformer, again integrate them with the model pipeline along with the ML algorithm.

For the above, consider a scenario where we are going to consider an categorical column (nominal type) here say this column has missing values and it needs one hot encoding,so here typically we can first use a column transformer to impute then pass to a pipeline — use this pipeline object again into main column transformer(for one hot encoding ) and then feed it into pipeline for the model.(i.e)

Integrating column transformers to pipeline then back to Main model pipeline

Now there must be a question coming to your mind as why we need to follow the above scenario:

The problem is that when we do imputation and one hot encoding together in a column transformer — the output will have multiple columns (including the ones which are imputed & non — imputed) and this is not ideal for the final pipeline. So it is always better to follow the step which is mentioned above. We will also see this type in our example problem.

One point to note is that all the steps in included in a pipeline except the final one should always be a Transformer.

Final step can be a Transformer or a model but more likely that it is a model.

Specifications on column transformers:

It is also possible to retrieve feature names from the Column transformers.All the transformers are stored in the named_transformers dictionary attribute.

We will now explore elaborately about the pipeline and column transformers with the help of penguin dataset.

The following dataset is referred from :

Allisonhorst-palmerpenguins

This can also be installed in python3 alternatively using

Pip install the above library for penguins dataset.

https://scikit-lego.readthedocs.io/en/latest/api/datasets.html#

palmerpenguins art — Artwork by @allison_horst

We will see only make_pipeline in this article. Make sure that your sklearn library is updated.

Make sure that your version is up to date — recommended is ≥0.22.1

Our target variable here is species which is multi-class.

Let us now do a quick EDA before jumping into Logistic regression using pipeline

From the above it is evident that there are missing values in our dataset
Here in the above → we are defining the a function for target ( a label encoder)

Let us now jump into the column_transformer part

In the above example dataset we are going to impute constant values for our nominal variables (ISLAND and SEX) before passing on to One hot encoding, and a Simple Imputer (mean for the numerical variables)

In the above scenario for imp_ohe → What is inside it ??

we have build a pipeline of transformers which first imputed missing values and then applied one-hot. Now we are taking that pipeline and adding to make_column_transformer, to the columns where we need to impute missing values and one-hot encode it.

In the dataset given abovethe column sex has missing values but island does not, it doesn’t matter to apply the pipeline which imputes missing values to both nothing will happen to island column as it does not have missing values and our objective here is to one hot encode both the nominal columns.

Here for the above make_column_transformer , for the parameter remainder we have the following options:

Pro tip for one hot encoder for the parameter — handle_unknown:

It is always better to retrain your model with data if it includes new category

Another point to note is that you can also select columns via slice and dice or using regex pattern or data-type:

Slice option used to select numerical columns
This col_sel can be utilized in our ct
The above pattern for reg expression is being utilized in the above step.

The above snapshots are just for explanation — you can try out those options and see.

The above steps are followed in most of the pipeline

When to go for fit() and fit_transform():

Generally speaking

i) when a pipeline is used which ends in a model, fit() method is used to fit the pipeline and predict() is used to predict on unknown data

ii) when a pipeline ends with a transformer. It is recommended to use fit_transform() [ on training data] and only transform() [on test or unknown data]

To conclude:

When a pipeline ends with a model use pipe.fit() and use pipe.fit_transform() when the pipeline ends with transformer itself.

If you want to look into the objects inside pipe — you can type just pipe and see what it constitutes
Let us separate the numerical and the categorical ones for determining important features

Till now we have seen how to build & work with the pipeline and column transformers but all of a sudden if anybody wants to look into what a model pipeline is made of then a diagrammatic representation would really help the end user to understand.

Credits: Google images

Let’s take a look at what is inside !! ⊙▃⊙

The following module is available in the version 0.23.2 of sklearn

It will show the entire step that our pipeline had undergone and when we click on these SimpleImputer or OneHotEncoder or LogisticRegression, we will encounter the full definition and it’s parameters

We can also see the important variables or features my making use of a library named eli5

The above code will give the important features for the pipeline that we have built

Grid Search CV and cross validation to evaluate our model:

Like any other ML objects — GridSearchCV and cross validation can also be performed on pipeline. Let us see how to proceed :

# For parameters to be used in GridSearchCV:
# Create a dictionary of all the parameter options

# Note that: you can access the parameters of steps of a pipeline by using '__’
Note the key points for parameter tuning before feeding it into GridSearchCV
Printing the best parameters for the GridSearchCV
In the above screenshot we could see how the pipeline is being utilized in cross_val_score (since our class is multi-class we have not given any scoring specifically)

How to export this model or save it using pickle libraries:

Comparing both pickle and joblib. I would recommend to use joblib for scikit-learn objects (it has more advantages in terms of efficiency for sklearn objects)

Instead of X_test -> any new dataset can be passed with the same structure of X_train

The above .joblib can be called in any other file in order to predict the species of the penguin based on the other column values (i.e) predict on entirely new dataset.

So far what we have seen till now is depicted below by using a illustrative diagram

Quick overview of imblearn pipeline:

There is also another library which is available for pipeline (imblearn)

Question arises in our mind that we already have sklearn pipeline then why to go for imblearn?

There is a restriction with respect to sklearn pipeline,sklearn’s pipeline only allows for one row in it to be transformed to another row (with different or added features). If you want to sample more rows (say increase or decrease (over sample,under sample or SMOTE)), it is not possible with sklearn pipeline.

To unsample,we need to increase the number of rows. Imbalanced-learn generalizes the pipeline, but tries to keep the syntax and function names the same:

from imblearn.pipeline import Pipeline, make_pipeline

The imblearn package contains a lot of different samplers for easy over- or under-sampling of data. These samplers cannot be placed in a standard sklearn pipeline.To allow for using a pipeline with these samplers, the imblearn package also implements an extended pipeline. This pipeline is very similar to the sklearn one with the addition of allowing samplers.

So in short — if you want to include samplers in the pipeline then go for imblearn pipeline else stick to sklearn one.

Note that make_pipeline is just a convenient method to create a new pipeline and the difference here is actually with the pipelines themselves.

To Conclude:

a) Pipeline in general can have multiple steps.

b) It cannot have multiple algorithms, only one algorithm inside a pipeline is allowed.

c)GridSearch ,crossvalidation and hyperparameter tuning of a pipeline can be done but you cannot include those items inside a pipeline.

You can find the whole code base which is used in this article here — Github repo and the dataset is also available there.

Image source : Giphy.com

Hope this article helps you to get an overall understanding about Pipeline and its capabilities.

If you like this article, don’t forget to give a clap and do let me know your views on pipeline in the comments.

Happy Learning!!! ◕ ◡ ◕

--

--