Machine Learning: Models to Production

Part 2: Build a Python Package from ML Model

Published in

Analytics Vidhya

6 min readJan 29, 2020

This is the second part of the multi-part series on how to build and deploy a machine learning model — building and installing a python package out of your predictive model in Python

The first part on building pipelines can be read here

The first part covers how to re-write your model code into the form of a sklearn pipeline for easy understanding, management, and edits. A model can be deployed without the pipeline structure, but it is always the best practice to make pipelines and separate different parts of the code (config, preprocessing, feature engineering, data, and tests).

This post builds up from the earlier code of building a pipeline. If you had difficulty following the previous article, you can read on how to build sklearn-pipelines on the Internet, and then look at the GitHub repos for each stage of package building

Part 1: Organize code in pipelines, Training the model

The directories are restructured as in the image below

This is just a part of the code which uses three main files: pipeline.py, preprocessors.py, and train_pipeline.py. Apart from this, train.csv and test.csv are stored in the folder /packages/regression_model/datasets

Every folder must have a __init__.py file (they are not present in the GitHub repo)

The GitHub repo for Part 1 is here

Details of directories:

Packages: Root folder containing the package
Regression_model: Name of the package

Datasets: Test.csv and train.csv — Kaggle datasets on Housing price predictions downloaded from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Trained_model: the place for saving the models in .pkl file

Files:
Pipeline.py : Build a pipeline with all the operations
Preprocessors.py: All the fit and transform functions used in the pipeline
Train_pipeline.py: Running the model and saving the models
Requirements.txt: All the necessary packages with versions which need to be installed

Prerequisites before running the model and training

Create a new environment

Building a new environment is recommended for various reasons. Read about it here

Add your directory to PYTHONPATH

Here is how to do it for mac [google for other OS, it is quite straightforward]


1. Open Terminal.app 
2. Open the file ~/.bash_profile in your text editor — e.g. atom ~/.bash_profile 
3. Add the following line to the end: export PYTHONPATH=”{{Full Path to packages\regression_model}}”
4. Close terminal
5. Open and test $ echo PYTHONPATH

Installing package: Need to run the following command with the correct location of requirements.txt file

$ pip install -r requirements.txt

Running the model (training):

$ python packages/regression_model/train_pipeline.py

Output: a new file regression_model.pkl is generated in the packages/regression_model/trained_models folder

Part 2: Restructuring the project, making predictions and writing tests

The project needs to be restructured (will be explained when building package) so that we have a separate package directory with its own requirements.txt file, as well as a separate test module for testing the models before deployment

GitHub repo for part 2 is here

Folder Structure

Note the new structure — there is a regression_model folder inside regression_model inside packages

The Github repo does not include __init__.py files, please add them (blank files, no content) before running

Adding Test folder will be covered just after this block, need to install PyTest for this

Major Changes

Packages/regression_model/regression_model/Config/config.py:

Config files with all the fixed variable names, features, name of train and test data, target variable. This is done to clean up the code and make it more readable. Also if something needs to be changed (say the name of the file or removing a feature), it can be done only at one place rather than going through the code

Using the config files:

from regression_model.config import config

Packages/regression_model/regression_model/processing/data_management.py
This contains functions to load_dataset, save_pipeline and load_pipeline. This cleans up the train_pipeline.py code

Using data_management.py

from regression_model.processing.data_management import ( load_dataset, save_pipeline)

Training the model (ensure you have added PYTHONPATH to environment variable as explained earlier)

$ python packages/regression_model/regression_model/train_pipeline.py

Make Predictions

$ python packages/regression_model/regression_model/predict.py

This will not print anything. To test if the modules are working fine, Test modules have to be added

Testing

New Directory for Test at packages/regression_model/tests

test_predict.py contains the code for testing the model

Requirements.txt: Add
# testing
pytest>=4.6.6,<5.0.0

Writing tests is optional but it is always recommended. This will ensure that you model does not break at any point after you make any major or minor change.

Part 3: Building the package

At this stage, your code is complete and has passed all the tests. The next step is building a package.

GitHub repo for Part 3 is here

These things need to be added to the current directory:

MANIFEST.in: provides detail on what files to keep in the package

Setup.py: Other details on the model, meta-data, requirements, license information and other details

Packages/regression_model/regression_model/requirements.txt: This is another requirements.txt file inside the package. This needs to be provided. There are two additional packages that needs to be installed for packaging, so make sure you run

$pip install -r packages/regression_model/regression_model/requirements.txt

Run: Command for building source distribution (sdist) and wheel distribution (bdist_wheel)

$ python packages/regression_model/setup.py sdist bdist_wheel

If all goes well, you’ll have the following new files in your directory

This will depend on your OS. This is built on MacOS 10.15

Your package is now ready to be installed and used — just like a normal Python package

Install Package

$ pip install -e packages/regression_model/

Use Package

The next post will cover some of the best practices (I know, there are a lot of them) — versioning & logging, and how to host this package on the web from where anyone can install this. Future posts will cover the deployment as an API — on Heroku and AWS

Read more on virtual environments and testing on these portals:

Python Virtual Environment | Introduction - GeeksforGeeks

A virtual environment is a tool that helps to keep dependencies required by different projects separate by creating…

www.geeksforgeeks.org

Getting Started With Testing in Python - Real Python

In this in-depth tutorial, you'll see how to create Python unit tests, execute them, and find the bugs before your…

realpython.com