Machine Learning: Models to Production

Part 1 — Build your own Sklearn Pipeline

Ashutosh Kumar
Analytics Vidhya
6 min readSep 27, 2019

--

This is the first part of a multi-part series on how to build machine learning models using Sklearn Pipelines, converting them to packages and deploying the model in a production environment. There are many ways to do this, the approach presented here is just one of the many.

What are Scikit-learn Pipelines

Pipelines are one of the ways of implementing procedural programming. In the procedural programming paradigm, procedures (functions or subroutines) are carried out as a series of computation steps.

There are many steps in building a machine learning model. The data is almost never clean, and you need to do some preprocessing (like normalizing, transforming, feature engineering etc) to ensure the speed and accuracy of your model is upto the mark. To implement the procedural programming, one way is to write individual functions for all the individual processes, and call them in sequence for training and testing datasets separately. Another way is to leverage the power of Scikit-learn pipeline to make this process easier, reproducible, easy to understand, easy to debug, and enforceable (to ensure no step is missed). Also, this way, the model becomes easier to deploy into a production environment.

Scikit-learn Pipeline

The Scikit-learn library in python is a powerful and one of the most used libraries in machine learning. It provides as efficient implementation of a host of algorithms, ranging from data transformations, preprocessing, and the entire suite of machine learning models. It is written in a form such that most of its algorithms follow the same functionality. This means if you know the code to implement a Logistic Regression, you can run SVM or Decision trees classifier but just changing the name of the classifier and a few parameters (more or less), and the code will run just fine. Scikit-learn is so well established that new packages in other libraries (like Keras) are designed keeping in mind scikit-learn functionality.

Overview of Code Examples:

Part A and Part B of code examples is just a showcase of how this is done for a simple model, and the code will not run on its own. You will need to download the dataset and import necessary libraries, and the code snippets are expected to serve only as guidelines. Part C of the code is divided into two parts — building a prediction model using functions, and building a prediction model using Sklearn pipeline with custom classes

Part A: A very basic implementation of pipeline for feature engineering and prediction

Part B: Implementing multiple algorithms using pre-built pipelines for a quick model building

Part C:

  • Implementing a prediction algorithm using functions
  • Converting functions into classes to follow OOP paradigm and building a custom-made pipelines

Part A: Basic Pipeline Codes

This is just to showcase how a prediction model using pipelines look like. There are more detailed explanations over the internet.

Data: Sonar Mines Rocks Dataset [Source UCI ML Repo(raw) and Kaggle(csv)]

I’m not going into detail of data, it’s a very simple classification data without any missing values or different data type, you can run a crude model on this data in just a few lines.

Another Example: chaining the fit and predict method together (source: StackOverFlow)

Classifier on word vectors: without pipelines

Same code above, using pipelines

Part B: Joining Multiple Pipes

This is an extension of above codes, and is sourced from the awesome blog by Jason Brownlee (machinelearningmastery.com — you should have a look at his blogs). This is a way of chaining multiple pipelines with different models together to quickly evaluate multiple algorithms in one shot.

Note — import necessary libraries before running this on any dataset. Assume X_train, X_test, Y_train and Y_test from the somarminesrocks dataset in Part A

Great way to reduce your code and ensure that the train and test follow the same procedures!

There is one problem in this approach — these are prebuilt functions and modules, and though they provide a level of flexibility in terms of defining the parameters, it does not allow you to modify the way these functions are run, or if you want to do things in a different way.

Most of the time, in almost 99% of the data science work, you have to write certain custom functions before your dataset can be fed to these pipelines. These would include processes like imputing missing values, label encoding of categorical variables, treating date variables correctly (converting date to months, or taking difference in number of days between two dates in different columns), Log transformation (or any other transformation) of certain features which are not Gaussian, Dropping certain features, or any other preprocessing step which needs to be run before you can call any model. Creating custom pipelines are key to do this effectively.

Custom Built Scikit-learn Pipelines

Advantages:

  • Define the preprocessing the way you want — the way it should be done since every data is different
  • Implemented in a robust Object Oriented way, so this approach is very structured
  • Handle exceptions in the data if and when they occur, and take necessary action
  • Ideal for production-grade code, and for converting the model into a package

One key advantage is obtained by breaking the entire code into different modules — one file for config variables, one code for building pipelines for each preprocessing step, once code to data import/export & saving/loading models, and one main code for calling and running a pipeline to train and save the model, and one code for running prediction on any data. This modular approach divides the entire code into chunks, and makes the maintenance and debugging easy. Furthermore, if you want to add a new feature transformer, or modify something else, you can do it with modifying just a part of module without going through the entire code.

Components of a Scikit-learn object:

Transformers: classes that have a fit and transform method for transforming data

  • Examples: Scalers, feature selectors or onehot encoders

Predictor: classes that have a fit and predict methods — for prediction

  • Examples — ML algorithms like LogisticRegression, Lasso, SVC etc

Pipeline: class that runs transformers and predictors in a sequence

  • All steps should be transformers except for the last one
  • The last step should be a predictor

Part C — Code Example: Converting a function to a Sklearn class

Data: [Kaggle]: House Price Prediction. Visit the Kaggle page for data description and other details

Objective: Given a bunch of numerical, categorical and temporal features, predict the SalePrice of the house.

Challenges: The data has a lot of missing variables, and also different data types (numeric and categorical). Also some features are skewed so need to be transformed. Categorical variables need to be encoded. More preprocessing can be applied, this is just a demo version of the code.

Example 1:

The function below takes into input a dataframe (‘X’), and a list of categorical features (‘features’), and returns the dataframe with missing values replaced by ‘Missing’. This function can be called for any dataframe to replace missing value in categorical variables

Scikit-learn Class for converting missing values in categorical data to ‘Missing’

BaseEstimator and TransformerMixin: Classes inherited from sklearn.base module which enables the pipeline functionality

Example 2:

Encoding Categorical Variables — a standard label encoder, with the labels being assigned in order of the frequencies of target variable for each category

Function:

Sklearn Class:

Complete Code:

1. Config variables — for defining variable list. This can be obtained after running the model first, understanding data and identifying features. typically this is saved in a separate file (config.py)

2. Implementing functions — Data Processing and Prediction using Functions

3. Implementing Pipelines — Data Processing and Prediction using Pipelines

Config Variables: common to both functional codes and pipeline codes

Implementing functional code for prediction (written in a bit crude form to ensure all steps are visible. Method chaining can be used here to chain multiple functions or they all can be called using a separate function, the example below is just a breakdown of steps)

Pipeline Code: Converting each function to a sklearn class and putting them in a pipeline

Once the classes are defined, the next step is to build the pipeline

The Github codes can be accessed here.

Part 2: Building Python Packages from your ML Model

--

--

Ashutosh Kumar
Analytics Vidhya

Data Science @ Epsilon ; interested in technology, data , algorithms and blockchain. Reach out to me at ashu.iitkgp@gmail.com