Machine Learning: Models to Production
Part 2: Build a Python Package from ML Model
This is the second part of the multi-part series on how to build and deploy a machine learning model — building and installing a python package out of your predictive model in Python
The first part on building pipelines can be read here
The first part covers how to re-write your model code into the form of a sklearn pipeline for easy understanding, management, and edits. A model can be deployed without the pipeline structure, but it is always the best practice to make pipelines and separate different parts of the code (config, preprocessing, feature engineering, data, and tests).
This post builds up from the earlier code of building a pipeline. If you had difficulty following the previous article, you can read on how to build sklearn-pipelines on the Internet, and then look at the GitHub repos for each stage of package building
Part 1: Organize code in pipelines, Training the model
The directories are restructured as in the image below
This is just a part of the code which uses three main files: pipeline.py, preprocessors.py, and train_pipeline.py. Apart from this, train.csv and test.csv are stored in the folder /packages/regression_model/datasets
Every folder must have a __init__.py file (they are not present in the GitHub repo)
The GitHub repo for Part 1 is here
Details of directories:
Packages: Root folder containing the package
Regression_model: Name of the package
Datasets: Test.csv and train.csv — Kaggle datasets on Housing price predictions downloaded from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
Trained_model: the place for saving the models in .pkl file
Files:
Pipeline.py : Build a pipeline with all the operations
Preprocessors.py: All the fit and transform functions used in the pipeline
Train_pipeline.py: Running the model and saving the models
Requirements.txt: All the necessary packages with versions which need to be installed
Prerequisites before running the model and training
Create a new environment
Building a new environment is recommended for various reasons. Read about it here
Add your directory to PYTHONPATH
Here is how to do it for mac [google for other OS, it is quite straightforward]
1. Open Terminal.app
2. Open the file ~/.bash_profile in your text editor — e.g. atom ~/.bash_profile
3. Add the following line to the end: export PYTHONPATH=”{{Full Path to packages\regression_model}}”
4. Close terminal
5. Open and test $ echo PYTHONPATH
Installing package: Need to run the following command with the correct location of requirements.txt file
$ pip install -r requirements.txt
Running the model (training):
$ python packages/regression_model/train_pipeline.py
Output: a new file regression_model.pkl is generated in the packages/regression_model/trained_models folder
Part 2: Restructuring the project, making predictions and writing tests
The project needs to be restructured (will be explained when building package) so that we have a separate package directory with its own requirements.txt file, as well as a separate test module for testing the models before deployment
GitHub repo for part 2 is here
Folder Structure
Note the new structure — there is a regression_model folder inside regression_model inside packages
The Github repo does not include __init__.py files, please add them (blank files, no content) before running
Adding Test folder will be covered just after this block, need to install PyTest for this
Major Changes
Packages/regression_model/regression_model/Config/config.py:
Config files with all the fixed variable names, features, name of train and test data, target variable. This is done to clean up the code and make it more readable. Also if something needs to be changed (say the name of the file or removing a feature), it can be done only at one place rather than going through the code
Using the config files:
from regression_model.config import config
Packages/regression_model/regression_model/processing/data_management.py
This contains functions to load_dataset, save_pipeline and load_pipeline. This cleans up the train_pipeline.py code
Using data_management.py
from regression_model.processing.data_management import ( load_dataset, save_pipeline)
Training the model (ensure you have added PYTHONPATH to environment variable as explained earlier)
$ python packages/regression_model/regression_model/train_pipeline.py
Make Predictions
$ python packages/regression_model/regression_model/predict.py
This will not print anything. To test if the modules are working fine, Test modules have to be added
Testing
New Directory for Test at packages/regression_model/tests
test_predict.py contains the code for testing the model
Requirements.txt: Add
# testing
pytest>=4.6.6,<5.0.0
Writing tests is optional but it is always recommended. This will ensure that you model does not break at any point after you make any major or minor change.
Read more about tests here
Contents of test_predict.py: Just check the first prediction is correct
Running Tests:
$ pytest packages/regression_model/tests -W ignore::DeprecationWarnings
Part 3: Building the package
At this stage, your code is complete and has passed all the tests. The next step is building a package.
GitHub repo for Part 3 is here
These things need to be added to the current directory:
MANIFEST.in: provides detail on what files to keep in the package
Setup.py: Other details on the model, meta-data, requirements, license information and other details
Packages/regression_model/regression_model/requirements.txt: This is another requirements.txt file inside the package. This needs to be provided. There are two additional packages that needs to be installed for packaging, so make sure you run
$pip install -r packages/regression_model/regression_model/requirements.txt
Run: Command for building source distribution (sdist) and wheel distribution (bdist_wheel)
$ python packages/regression_model/setup.py sdist bdist_wheel
If all goes well, you’ll have the following new files in your directory
This will depend on your OS. This is built on MacOS 10.15
Your package is now ready to be installed and used — just like a normal Python package
Install Package
$ pip install -e packages/regression_model/
Use Package
The next post will cover some of the best practices (I know, there are a lot of them) — versioning & logging, and how to host this package on the web from where anyone can install this. Future posts will cover the deployment as an API — on Heroku and AWS
Read more on virtual environments and testing on these portals: