Packaging in Python

by Eric Pan

Published in

The Opex Analytics Blog

5 min readJul 5, 2019

Imagine you have five machine learning projects going on, all of which involve extracting data from a database and creating features from this raw data. Let’s say that these functions are common across your slate of projects, and in the status quo, you have to make changes to each version of your data extraction/feature creation code every time you tweak one instance of it. In this case, you might want to wrap up all the relevant code into a package, instead of including it in the scripts for each project. In this post, we’ll cover the basics and benefits of packaging your Python code.

(In this post, the code snippets will reference my prior post on building safe, easy-to-use database connections in Python — you don’t need to read it to understand what we’ll talk about here, but it couldn’t hurt!)

Packaging allows you to manage your projects independently and conveniently. It’s especially valuable in larger projects with distinct process stages: ETL, cleansing, feature engineering, modeling, output, etc. Imagine you have five machine learning projects, all of which involve extracting data from a database and creating features from this raw data. If you have these functions all bundled up into packages, you just have to import them into each project instead of copy/pasting them for each and maintaining each instance separately.

As a practical example, let’s say that your customer decided that they want to make a change to their database, and the ETL functions you’ve created (customer_clean and customer_features, we’ll say) have to change accordingly. Packaging can reduce the headache that this entails: instead of changing the code for five different projects (i.e. commit, push, and deploy all changes for each), you just have to change the code for the customer_clean class in your one and only data ETL package.

So let’s package it up! With our two hypothetical ETL classes (customer_clean and customer_features) included, the codebase we want to include in our package should resemble the following, which we’ll call functions.py:

< functions.py >
import cx_Oracle
import logging
import osclass oracle_connection(object):
    ...def db_connector(func):
    ...class customer_clean(data):
    ...class customer_features(data):
    ...

Let’s start the process by ensuring you have setuptools installed:
pip install setuptools

Next, check that your directory matches this structure:

/client_package_dir
    /client_package
        __init__.py
        functions.py
    LICENSE     # see below
    README.md
    setup.py    # see below

Below is an example version of setup.py. Each entry’s meaning should be pretty straightforward; some of them can be ignored (e.g. README.md, classifiers), but I recommended you fill them out anyway. Of course, you must also give your package a name. Here, we’ll keep it simple and refer to our code as client_package. (Note that we are using the cx_Oracle package for database connections, so don’t forget to include it in install_requires!)

import setuptools# if you have a README.md file
with open("README.md", "r") as fh:
    long_description = fh.read()setuptools.setup(
    name="client_package",
    version="0.0.1",
    author="Eric Pan",
    author_email="eric.pan@opexanalytics.com",
    description="Supporting packages for project",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/",
    packages=setuptools.find_packages(),
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
    ],
    install_requires=[
        "cx_Oracle",
    ],
)

You can find a license here. For most projects, an MIT license will do. It’s not technologically necessary to include a license, but it’s often recommended.

The __init__.py file is for telling Python that this directory is a package — you can leave it empty.

After it’s created, there are two ways to use our package:

Install and use it locally
Build a wheel (and potentially publish it for the world to use)

The first option is generally used for development. With local installation, when we run import client_package, Python just looks in the current directory.

The second option is for a tested, stable version of a package. It’ll install the project and make it accessible to the entire Python environment. When building a wheel, we generate a .whl file and install it. After doing so, the package will be located in <Python path>/Lib.

Let’s take a look at the local approach first. In multiple projects, you may see people use import to import from scripts in the project directory. A common practice is to run the following command in the topmost such directory:

pip install -e .

This tells Python to install your project as a package, but instead of installing it to <Python path>/Lib, it only looks for the package in your project directory. You’ll just need a setup.py (like the one above) alongside any code you wish to package up. You can also bundle scripts located in a subdirectory, making the resulting package available in the directory in which the command is run:

pip install -e /example_sub_dir

Instead of adding your project directory to your PATH (check this for more about adding to PATH), this local approach is safer (it doesn’t impact our environment) and more convenient. It’s also easy to use: on a new machine, just git pull your project, create a new environment, and then run the above command to automatically install all dependencies.

The second approach is standard for creating publicly available packages, but might also suit your needs for local use too. If you want to build your project into a wheel (and even publish it), you can run the following command in the client_package_dir path:

python setup.py bdist_wheel

And doing so will give you a structure like this:

/client_package_dir
    /build
      ...
    /dist
        client_package-0.0.1-py3-none-any.whl
    /client_package
        __init__.py
        functions.py
    /client_package.egg-info
      ...
    LICENSE
    README.md
    setup.py

The client_package-0.0.1-py3-none-any.whl file is the wheel file you want. You can simply use pip install client_package-0.0.1-py3-none-any.whl to install it right into a Python environment, neatly and robustly. Usage is as simple as importing and calling the classes and methods within:

from client_package.functions import oracle_connectionwith oracle_connection() as conn:
    data =  pd.read_sql(sql, conn.connector)

If you’re interested in publishing your package (either into PyPI or into your company’s internal package index), I’d recommend using twine (find details here).

Packaging has many benefits, from easing the burden of code maintenance across multiple projects, to increasing adoption both internally and externally. Happy packaging!

_________________________________________________________________

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.

Packaging in Python

by Eric Pan

Written by Opex Analytics