How to use pipeline in Python — Basic version
When you want to build any Machine Learning model you should go through some steps like: Pre-processing, Feature Selection, Feature Engineering, Training the Model and then use the built model to predict the target value of the unseen data.
In the most basic approach, one can apply the same transformation that has been applied on the Training set, on the Test set but this will lead to a messy coding and sometimes hard to follow. With the help of pipelines, we have the ability to apply the same transformations we used for the Training set on the Test set pretty easily without having Data Leakage (it happens when we have only one set of data and we apply our transformation only on that set and then split it into Training set and Test set which result in to that our final model learns some information from the Test set apriori which should be prevented as in real time we don’t have access to the unseen data).
In this article we will show you the most simple example of pre-processing of the data using pipelines. First we will implement the process using the simple approach without using the pipeline and then compare it when we are using pipelines.
Dataset
For our example we will work on Wine data which our target is to predict the quality of the wines given different features. The dataset can be found below:
The dataset has been desigend for multiclass classification but as we want to keep everything simple we will transform the work to a binary classification which will see in the following steps.
Note: Here we will put a picture of the codes that we have implemented. At the end of the article we will share with you the .ipynb file were you can access the whole code to try on your own.
Useful packages
In the following you can see all of the packages that have been used in our code. For the ones that don’t have the packages already installed we have provided the instructions in comments.
#!pip install pandas
import pandas as pd
#!pip install -U matplotlib
import matplotlib.pyplot as plt
#!pip install numpy
import numpy as np
#!pip install seaborn
import seaborn as sns
#!pip install -U scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
#!pip install missingno
import missingno as msno
In case you haven’t already installed the packages, just uncomment the comments and run the code.
Data loading and transformation
In the following section we will read that data and get to know the data a little better. We will also apply some transformation to make the data suitable to show our example.
As we can see above we have ‘1143’ records with ‘13’ columns. The target variable (the value that we want to predict) is quality which refers to the quality of the wine.
We can see that there is a feature named ‘Id’ in the data that is unique for all of the records. So we can safely remove this feature from our data. Then we shall look at the data type of the features that we have in our dataset.
We can see that we only have numeric features in our data.
In the next step we get some information about the distribution of the features of our dataset.
Here what is really important for us is the range of the ‘quality’ (the target variable) which can have the values of [3, 4, 5, 6, 7, 8]. This will lead us t have a multiclass classification and as we want to keep everything simple we will convert this into a binary classification. How we are doing so?
Let’s see. We will define another quality based on the values of the quality. We will denote the wines that have the quality value as [3, 4, 5] to low quality wines and put 0 as their quality, and the others as high quality wines and put 1 as their quality.
Now we can see we have only two variable as the target variable which are 0 and 1.
Then we will check if there is any duplicated records in the dataset. Duplicated records will cause the model to introduce more weights to those records during Training and will skew the performance metric of the model during Testing. The typical approach to handle duplicated records is to remove them from the data and we will do so in the following way:
After removing the duplicated values we are left with ‘1018’ records in our dataset.
As the next step we will check if we have any missing data in the dataset:
We can see that there is no missing values in the dataset.
Note: As we want to show you how you can also handle the missing values in the pipeline, we ourselves will introduce some missing values to the data. In precise we will introduce 1% percent of missing values in each of the features of the dataset (except the target variable quality). In the following you can check the procedure:
We will also visualize how is the distribution of the missing values in each feature of the data:
Now we are make sure that the missing values are present in the dataset.
The next step is to split the data into two different sets Training and Test sets. We will keep the 70% of the data for the Training and 30% for the Testing. We will do so by the following code:
Now we are ready to do the pre-processing and model fitting steps once without pipeline and once with using pipeline.
Pre-processing and model fitting — Without pipeline (basic approach)
First we want to highlight again we want to keep everything simple cause the main idea is to explain how to use pipelines when you want to fit a model. Here are the steps that we will follow in order:
Pre-processing
1. We will handle the missing value by replacing each missed value by the most frequent value in each feature.
2. We will scale the data using MinMaxScaler that makes sure in each feature the values will be in the interval [0, 1] (we shall scale the data to give the same importance to each feature in the dataset. We will right another article about why this is important step to follow!)
Model fitting
3. We will build a model using the Logistic Regression algorithm which gives us a linear model to split the data into two different sets. (again we kept everything simple)
Note: We should note that the pre-processing step should be done once for the training set and once for the test set using the same parameters learnt during training.
Training set pre-processing
In the pre-processing of the training set we will first handle the missing values and afterward scaling the data.
Handling missing values
We will use an object of SimpleImputer offered by sklearn to replace the missing values in each feature by the most common value for each feature. Here is how we have done it:
Now we don’t have any missing values in the training set.
Note: Keep in mind that the object imp (which is our Imputer) has learnt what is the most frequent value for each feature of our Training data. The same parameters would be applied for the Test data.
Data scaling
Now we are ready to scale our Training data. First we will look at how the data is distributed in each feature before scaling:
For instance you can check the minimum values in ‘fixed acidity’ and ‘citric acid’ are ‘4.6’ and ‘0’ and maximum value s ‘15.9’ and ‘1’ respectively. So they are not in the same scale which will affect the final model in the end. We can scale the values to be defined on the same interval as follows:
Now we have scaled the values in each feature individually. You can confirm that in each feature the values are in the interval [0, 1].
Note: The object that we have created here of MinMaxScaler() will learn what are the minimum and maximum value in each feature for the future use. We will apply the same scaling when we want to scale the test set.
Model fitting
Now all the features have been scaled to be in the [0,1] interval. We are ready to build our model using our training set. We will do that as follows:
Our model is ready. What is remaining is that we should apply the same transformation on the test set and use the model on the test set. We will apply the transformation as follows:
We have handled the missing values and also scaled to values in the test set. We are ready for predicting the target variable for the test data. Here you can see how the model performed on the test set:
You can find above the confusion matrix which reports how compliant are the predictions with respect to the actual target variable of the test set. In the following you can find different reports:
We were able to get ‘71%’ accuracy using our simple model on our test set.
Pre-processing and model fitting — Using pipeline
So far what we have seen was to how build a model in a very basic approach. Now it’s time to see how we can apply the same transformations on the data using pipeline.
Building the pipeline
First we should build the pipeline and denote what are the steps that the data should go through. Here you can find the same transformation and model that has been used in the previous approach:
Now we are ready to fit our pipeline our training set. What will happen is that the whole training data will go through these steps one by one. The output of one step will be given to the next one and the parameters will be learnt in each step (ex. most frequent values for each feature, the minimum and maximum value in each feature). At the end the model will be created . Here you can find how we have fitted the pipeline:
Now the parameters have been learnt and the model has been created we are able to use the pipeline in order to make the predictions over our test set. That’s how we do it:
We are ready to check how our final model has performed on the test set:
We can see that again we have reached the same accuracy as before. This confirms that we have done everything correctly in our work.
You can find a Google Colab Notebook containing all of the codes that have been covered at the following link.
Note: Make sure you get a copy from the notebook and run it on your account. Also download and upload the data from Kaggle in advance.
We hope that we were able to show you have the pipelines show be used in Python. You can find another blog where we will show you how to write custom functions to be included in pipeline at the following link:
Thanks for you attention. Leave us comments to help us improve the contents. See you soon!