Machine Learning: Models to Production

Part 2: Build a Python Package from ML Model

Ashutosh Kumar
Analytics Vidhya
6 min readJan 29, 2020

--

This is the second part of the multi-part series on how to build and deploy a machine learning model — building and installing a python package out of your predictive model in Python

The first part on building pipelines can be read here

The first part covers how to re-write your model code into the form of a sklearn pipeline for easy understanding, management, and edits. A model can be deployed without the pipeline structure, but it is always the best practice to make pipelines and separate different parts of the code (config, preprocessing, feature engineering, data, and tests).

This post builds up from the earlier code of building a pipeline. If you had difficulty following the previous article, you can read on how to build sklearn-pipelines on the Internet, and then look at the GitHub repos for each stage of package building

Part 1: Organize code in pipelines, Training the model

The directories are restructured as in the image below

This is just a part of the code which uses three main files: pipeline.py, preprocessors.py, and train_pipeline.py. Apart from this, train.csv and test.csv are stored in the folder /packages/regression_model/datasets

Every folder must have a __init__.py file (they are not present in the GitHub repo)

The GitHub repo for Part 1 is here

Details of directories:

Packages: Root folder containing the package
Regression_model: Name of the package

Datasets: Test.csv and train.csv — Kaggle datasets on Housing price predictions downloaded from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Trained_model: the place for saving the models in .pkl file

Files:
Pipeline.py : Build a pipeline with all the operations
Preprocessors.py: All the fit and transform functions used in the pipeline
Train_pipeline.py: Running the model and saving the models
Requirements.txt: All the necessary packages with versions which need to be installed

Prerequisites before running the model and training

Create a new environment

Building a new environment is recommended for various reasons. Read about it here

Add your directory to PYTHONPATH

Here is how to do it for mac [google for other OS, it is quite straightforward]

Installing package: Need to run the following command with the correct location of requirements.txt file

Running the model (training):

Output: a new file regression_model.pkl is generated in the packages/regression_model/trained_models folder

Part 2: Restructuring the project, making predictions and writing tests

The project needs to be restructured (will be explained when building package) so that we have a separate package directory with its own requirements.txt file, as well as a separate test module for testing the models before deployment

GitHub repo for part 2 is here

Folder Structure

Note the new structure — there is a regression_model folder inside regression_model inside packages

The Github repo does not include __init__.py files, please add them (blank files, no content) before running

Adding Test folder will be covered just after this block, need to install PyTest for this

Major Changes

Packages/regression_model/regression_model/Config/config.py:

Config files with all the fixed variable names, features, name of train and test data, target variable. This is done to clean up the code and make it more readable. Also if something needs to be changed (say the name of the file or removing a feature), it can be done only at one place rather than going through the code

Using the config files:

Packages/regression_model/regression_model/processing/data_management.py
This contains functions to load_dataset, save_pipeline and load_pipeline. This cleans up the train_pipeline.py code

Using data_management.py

Training the model (ensure you have added PYTHONPATH to environment variable as explained earlier)

Make Predictions

This will not print anything. To test if the modules are working fine, Test modules have to be added

Testing

New Directory for Test at packages/regression_model/tests

test_predict.py contains the code for testing the model

Writing tests is optional but it is always recommended. This will ensure that you model does not break at any point after you make any major or minor change.

Read more about tests here

Contents of test_predict.py: Just check the first prediction is correct

Running Tests:

$ pytest packages/regression_model/tests -W ignore::DeprecationWarnings

Part 3: Building the package

At this stage, your code is complete and has passed all the tests. The next step is building a package.

GitHub repo for Part 3 is here

These things need to be added to the current directory:

MANIFEST.in: provides detail on what files to keep in the package

Setup.py: Other details on the model, meta-data, requirements, license information and other details

Packages/regression_model/regression_model/requirements.txt: This is another requirements.txt file inside the package. This needs to be provided. There are two additional packages that needs to be installed for packaging, so make sure you run

Run: Command for building source distribution (sdist) and wheel distribution (bdist_wheel)

If all goes well, you’ll have the following new files in your directory

This will depend on your OS. This is built on MacOS 10.15

Your package is now ready to be installed and used — just like a normal Python package

Install Package

Use Package

The next post will cover some of the best practices (I know, there are a lot of them) — versioning & logging, and how to host this package on the web from where anyone can install this. Future posts will cover the deployment as an API — on Heroku and AWS

--

--

Ashutosh Kumar
Analytics Vidhya

Data Science @ Epsilon ; interested in technology, data , algorithms and blockchain. Reach out to me at ashu.iitkgp@gmail.com