Pipelines in Scikit-Learn

Dr. Roi Yehoshua
AI Made Simple
Published in
6 min readMar 2, 2023

--

This article covers a very useful feature provided by the Scikit-Learn library, which can help streamline your machine learning models and make it easier to maintain and deploy them.

A pipeline defines a chain of transformations that are applied to your data set sequentially, where the last step in the chain is your machine learning model (e.g., your classifier or regressor).

Scikit-Learn Pipeline

There are many advantages of using a pipeline to define your models:

  1. It allows you to keep all the definitions and components of your model in one place, which makes it easier to reuse the model or change it in the future.
  2. You can use grid search and cross-validate all the steps of the model together.
  3. The pipeline automatically performs the relevant operations when it is applied to the training and the test sets. For example, in the training phase, it calls the fit_transform() method of all the transformers, while in the prediction/test phase it calls their transform() method.

Building a Simple Pipeline

Let’s build a regression model for the California housing dataset available at Scikit-Learn. The goal in this data set is to predict the median house value of a given district (house block) in California, based on 8 different features of that district (such as…

--

--

Dr. Roi Yehoshua
AI Made Simple

Teaching Professor for Data Science and ML at Northeastern University | Top Writer in AI | 200K+ Views on Medium | https://www.linkedin.com/in/roi-yehoshua/