How to use pipeline in Python — Basic version

Mehrdad Hassanzadeh
10 min readMay 25, 2022

--

When you want to build any Machine Learning model you should go through some steps like: Pre-processing, Feature Selection, Feature Engineering, Training the Model and then use the built model to predict the target value of the unseen data.

In the most basic approach, one can apply the same transformation that has been applied on the Training set, on the Test set but this will lead to a messy coding and sometimes hard to follow. With the help of pipelines, we have the ability to apply the same transformations we used for the Training set on the Test set pretty easily without having Data Leakage (it happens when we have only one set of data and we apply our transformation only on that set and then split it into Training set and Test set which result in to that our final model learns some information from the Test set apriori which should be prevented as in real time we don’t have access to the unseen data).

In this article we will show you the most simple example of pre-processing of the data using pipelines. First we will implement the process using the simple approach without using the pipeline and then compare it when we are using pipelines.

Dataset

For our example we will work on Wine data which our target is to predict the quality of the wines given different features. The dataset can be found below:

The dataset has been desigend for multiclass classification but as we want to keep everything simple we will transform the work to a binary classification which will see in the following steps.

Note: Here we will put a picture of the codes that we have implemented. At the end of the article we will share with you the .ipynb file were you can access the whole code to try on your own.

Useful packages

In the following you can see all of the packages that have been used in our code. For the ones that don’t have the packages already installed we have provided the instructions in comments.

#!pip install pandas 
import pandas as pd

#!pip install -U matplotlib
import matplotlib.pyplot as plt

#!pip install numpy
import numpy as np

#!pip install seaborn
import seaborn as sns

#!pip install -U scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

#!pip install missingno
import missingno as msno

In case you haven’t already installed the packages, just uncomment the comments and run the code.

Data loading and transformation

In the following section we will read that data and get to know the data a little better. We will also apply some transformation to make the data suitable to show our example.

Data loading and checking the data

As we can see above we have ‘1143’ records with ‘13’ columns. The target variable (the value that we want to predict) is quality which refers to the quality of the wine.

We can see that there is a feature named ‘Id’ in the data that is unique for all of the records. So we can safely remove this feature from our data. Then we shall look at the data type of the features that we have in our dataset.

Dropping the feature ‘Id’ and checking the data types

We can see that we only have numeric features in our data.

In the next step we get some information about the distribution of the features of our dataset.

Data distribution

Here what is really important for us is the range of the ‘quality’ (the target variable) which can have the values of [3, 4, 5, 6, 7, 8]. This will lead us t have a multiclass classification and as we want to keep everything simple we will convert this into a binary classification. How we are doing so?

Let’s see. We will define another quality based on the values of the quality. We will denote the wines that have the quality value as [3, 4, 5] to low quality wines and put 0 as their quality, and the others as high quality wines and put 1 as their quality.

From multiclass to binary classification

Now we can see we have only two variable as the target variable which are 0 and 1.

Then we will check if there is any duplicated records in the dataset. Duplicated records will cause the model to introduce more weights to those records during Training and will skew the performance metric of the model during Testing. The typical approach to handle duplicated records is to remove them from the data and we will do so in the following way:

Handling duplicated records

After removing the duplicated values we are left with ‘1018’ records in our dataset.

As the next step we will check if we have any missing data in the dataset:

Check for missing values

We can see that there is no missing values in the dataset.
Note: As we want to show you how you can also handle the missing values in the pipeline, we ourselves will introduce some missing values to the data. In precise we will introduce 1% percent of missing values in each of the features of the dataset (except the target variable quality). In the following you can check the procedure:

Introducing some missing values to the data

We will also visualize how is the distribution of the missing values in each feature of the data:

Missing values distribution

Now we are make sure that the missing values are present in the dataset.

The next step is to split the data into two different sets Training and Test sets. We will keep the 70% of the data for the Training and 30% for the Testing. We will do so by the following code:

Data splitting

Now we are ready to do the pre-processing and model fitting steps once without pipeline and once with using pipeline.

Pre-processing and model fitting — Without pipeline (basic approach)

First we want to highlight again we want to keep everything simple cause the main idea is to explain how to use pipelines when you want to fit a model. Here are the steps that we will follow in order:

Pre-processing

1. We will handle the missing value by replacing each missed value by the most frequent value in each feature.

2. We will scale the data using MinMaxScaler that makes sure in each feature the values will be in the interval [0, 1] (we shall scale the data to give the same importance to each feature in the dataset. We will right another article about why this is important step to follow!)

Model fitting

3. We will build a model using the Logistic Regression algorithm which gives us a linear model to split the data into two different sets. (again we kept everything simple)

Note: We should note that the pre-processing step should be done once for the training set and once for the test set using the same parameters learnt during training.

Training set pre-processing

In the pre-processing of the training set we will first handle the missing values and afterward scaling the data.

Handling missing values

We will use an object of SimpleImputer offered by sklearn to replace the missing values in each feature by the most common value for each feature. Here is how we have done it:

Handling missing values — Training set

Now we don’t have any missing values in the training set.

Note: Keep in mind that the object imp (which is our Imputer) has learnt what is the most frequent value for each feature of our Training data. The same parameters would be applied for the Test data.

Data scaling

Now we are ready to scale our Training data. First we will look at how the data is distributed in each feature before scaling:

Training data distribution

For instance you can check the minimum values in ‘fixed acidity’ and ‘citric acid’ are ‘4.6’ and ‘0’ and maximum value s ‘15.9’ and ‘1’ respectively. So they are not in the same scale which will affect the final model in the end. We can scale the values to be defined on the same interval as follows:

Scaled training data distribution

Now we have scaled the values in each feature individually. You can confirm that in each feature the values are in the interval [0, 1].

Note: The object that we have created here of MinMaxScaler() will learn what are the minimum and maximum value in each feature for the future use. We will apply the same scaling when we want to scale the test set.

Model fitting

Now all the features have been scaled to be in the [0,1] interval. We are ready to build our model using our training set. We will do that as follows:

Model fitting using the training data

Our model is ready. What is remaining is that we should apply the same transformation on the test set and use the model on the test set. We will apply the transformation as follows:

Training set pre-processing

We have handled the missing values and also scaled to values in the test set. We are ready for predicting the target variable for the test data. Here you can see how the model performed on the test set:

Model result over test set

You can find above the confusion matrix which reports how compliant are the predictions with respect to the actual target variable of the test set. In the following you can find different reports:

Model report over test set

We were able to get ‘71%’ accuracy using our simple model on our test set.

Pre-processing and model fitting — Using pipeline

So far what we have seen was to how build a model in a very basic approach. Now it’s time to see how we can apply the same transformations on the data using pipeline.

Building the pipeline

First we should build the pipeline and denote what are the steps that the data should go through. Here you can find the same transformation and model that has been used in the previous approach:

Building the pipeline

Now we are ready to fit our pipeline our training set. What will happen is that the whole training data will go through these steps one by one. The output of one step will be given to the next one and the parameters will be learnt in each step (ex. most frequent values for each feature, the minimum and maximum value in each feature). At the end the model will be created . Here you can find how we have fitted the pipeline:

Pipeline fitting

Now the parameters have been learnt and the model has been created we are able to use the pipeline in order to make the predictions over our test set. That’s how we do it:

Test set prediction

We are ready to check how our final model has performed on the test set:

Model report over test set

We can see that again we have reached the same accuracy as before. This confirms that we have done everything correctly in our work.

You can find a Google Colab Notebook containing all of the codes that have been covered at the following link.

Note: Make sure you get a copy from the notebook and run it on your account. Also download and upload the data from Kaggle in advance.

We hope that we were able to show you have the pipelines show be used in Python. You can find another blog where we will show you how to write custom functions to be included in pipeline at the following link:

Thanks for you attention. Leave us comments to help us improve the contents. See you soon!

--

--

Mehrdad Hassanzadeh

I’m currently a Master of Data Science student at Sapienza University, Rome, Italy