A Simple Guide to Scikit-learn Pipelines
Learn how to use pipelines in a scikit-learn machine learning workflow
--
In most machine learning projects the data that you have to work with is unlikely to be in the ideal format for producing the best performing model. There are quite often a number of transformational steps such as encoding categorical variables, feature scaling and normalisation that need to be performed. Scikit-learn has built in functions for most of these commonly used transformations in it’s preprocessing package.
However, in a typical machine learning workflow you will need to apply all these transformations at least twice. Once when training the model and again on any new data you want to predict on. Of course you could write a function to apply them and reuse that but you would still need to run this first and then call the model separately. Scikit-learn pipelines are a tool to simplify this process. They have several key benefits:
- They make your workflow much easier to read and understand.
- They enforce the implementation and order of steps in your project.
- These in turn make your work much more reproducible.
In the following post I am going to use a data set, taken from Analytics Vidhya’s loan prediction practice problem, to describe how the pipelines work and how to implement them.
Transformers
First I have imported the train and test files into a jupyter notebook. I have dropped the ‘Loan_ID’ column as this will not be needed in training or prediction. I have used the pandas dtypes function to get a little information about the dataset.
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train = train.drop('Loan_ID', axis=1)
train.dtypes
I can see that I have both categorical and numeric variables so as a minimum I am going to have to apply a one hot encoding transformation and some sort of scaler. I am going to use a scikit-learn pipeline to perform those transformations, and at the same time apply the fit method.
Before building the pipeline I am splitting the training data into a train and test set so that I can validate the performance of the…