A Simple Guide to Scikit-learn Pipelines

Learn how to use pipelines in a scikit-learn machine learning workflow

Rebecca Vickery
vickdata

--

In most machine learning projects the data that you have to work with is unlikely to be in the ideal format for producing the best performing model. There are quite often a number of transformational steps such as encoding categorical variables, feature scaling and normalisation that need to be performed. Scikit-learn has built in functions for most of these commonly used transformations in it’s preprocessing package.

However, in a typical machine learning workflow you will need to apply all these transformations at least twice. Once when training the model and again on any new data you want to predict on. Of course you could write a function to apply them and reuse that but you would still need to run this first and then call the model separately. Scikit-learn pipelines are a tool to simplify this process. They have several key benefits:

  • They make your workflow much easier to read and understand.
  • They enforce the implementation and order of steps in your project.
  • These in turn make your work much more reproducible.

In the following post I am going to use a data set, taken from Analytics Vidhya’s loan prediction practice problem, to…

--

--