Guide to writing Python packages for Data Scientists

Gaurang Mehra
6 min readAug 11, 2023

--

Write a python package for common Data science tasks.

Most code in Data Science and the Machine Learning world tends to be linear and in Jupyter notebooks. While notebooks are a great way to get started, they are not great for writing modular code. There are many tasks in your Data Science workflow which you might want to perform again and again. You should write functions for these tasks and group functions that perform similar tasks in a package. This way you can use these packages from one project to the next. In this article we will walk through a package that contains functions to impute missing values and to standardize all column names to lower case. These are tasks that are repeatable and are amenable to set up as functions in a preprocessing package.

The first step is to set up your package directory. There are 2 ways to do this:

  • Set up your own bare bones package directory. This should have 2 things, an __init__.py file and the script. This package directory is a local module that you can copy paste into every project and then import and use
  • Use cookie cutter to create a project template complete with set up tools. You can install this at an environment level, and then import like numpy or pandas. Using the cookiecutter package and a template will automatically create the package template with important file templates like a readme file and a setup file. You can fill out these templates to the level of detail that you require.

Part 1: Writing a bare bones package

In order to write a barebones package you need to create your folder with your package name. In that folder you need a __init__.py file. This file tells python that this directory is a package. Its generally a blank file. You will also need to create a file with your functions.

fig 1.1 package directory

The __init__.py file and the file with the preprocessing code called pre1 is stored in the module folder called preprocessing

The pre1 file contains functions to standardize the column headers of a dataframe and to impute missing values. The functions are general and can take any dataframe as an input. See the code below and some notes:

  • We need both numpy and pandas so we have imported them at the the top of the pre1 file.
  • We have written our functions below that. A good practice is to have docstrings using triple quotes for each function. A standard docstring contains
    - Short function description
    - Parameters, what parameters does the function take
    - Returns, what type of value does the function return
import numpy as np
import pandas as pd

def low_case_columns(df):
"""Function to standardize column headers to lower case

Parameters
----------
df : pandas dataframe
pandas data frame for which you need lower case columns

Returns
-------
df: pandas data frame
dataframe with transformed column headers to lower case
"""
df.columns = df.columns.str.lower()
return df


def impute_numeric_values(df,method=np.mean):
"""Takes a data frame and imputes missing values for the all the numeric columns


Parameters
----------
df : pandas dataframe
data frame with missing values
method : func, optional
method used to impute missing values, by default np.mean

Returns
-------
df : pandas data frame
dataframe with missing values imputed
"""
num_cols=df.select_dtypes(include=['int','float']).columns
for col in num_cols:
df[col] = df[col].fillna(value=np.mean(df[col]))
return(df)


def impute_object_columns(df):
"""Takes a dataframe and imputes using mode for object columns with nan

Parameters
----------
df : pandas dataframe
Dataframe with missing values

"""
obj_cols=df.select_dtypes(include='object').columns
for col in obj_cols:
df[col] = df[col].fillna(value = df[col].value_counts()[0])

We can now use this package. The only caveat is that your script has to be in the same directory as the package directory. If not you will get an import error.

Fig 1.2 Creating test jupyter notebook in package directory
fig 1.3 importing package and using it in test notebook

We import the pre1 module from preprocessing using the dot notation as shown in line 1 above. We can then access the functions in the pre1 module. Since these are general functions for any dataframe we used them on a toy dataframe that we have defined in the second line of code. Here we see that the functions low_case_columns and impute_numeric_values work as intended.

In this case we created a package with some general use functions and used it in a local folder. In part 2 we will see how to install a package in our environment

Part 2: Writing a package at an environment level

In this case we use cookie cutter and the link to a package template. Cookie cutter then asks a set of questions and creates a package template folder. In this folder you will have a set of template files and a code folder.

  1. Install cookie cutter using pip
pip install cookiecutter

2. Use cookiecutter together with a link to a package template. In this case we use the Oldani template and answer the questions.
- Best practice to use pytest
- Always use a version number
- Provide a project name and a slug

(plotly) C:\Users\gaura>cookiecutter https://github.com/oldani/cookiecutter-simple-pypackage
full_name [Ordanis Sanchez]: Gaurang Mehra
email [ordanisanchez@gmail.com]: gaurangmehra@gmail.com
github_username [oldani]: gmehra123
project_name [Python Boilerplate]: preprocess
project_slug [preprocess]: preprocess
project_short_description:
pypi_username [gmehra123]:
version [0.1.0]: 0.1.0
use_pytest [n]: y
use_pypi_deployment_with_travis [y]: n
add_pyup_badge [n]: n
Select command_line_interface:
1 - Click
2 - No command-line interface
Choose from 1, 2 [1]: 1
create_author_file [y]: y
Select open_source_license:
1 - MIT license
2 - BSD license
3 - ISC license
4 - Apache Software License 2.0
5 - GNU General Public License v3
6 - Not open source
Choose from 1, 2, 3, 4, 5, 6 [1]:5

3. Now we navigate to the preprocess folder. This folder has a readme file which is a template. You can fill out this readme template file. It also has a license file with the license information that you provided in the previous step

fig 1.4 Template package directory

4. Now we go to preprocess folder and navigate to the file labelled preprocess. we then copy our script with the functions in this file

Fig 1.5 Copy paste the functions into the template preprocess file

5. Copy and paste the script functions into the template file.

6. Now navigate to the folder and install using pip. Use the -e option so you can install the package in editable mode.

(plotly) C:\Users\gaura>cd preprocess

(plotly) C:\Users\gaura\preprocess>pip install -e.

Now the package is installed at an environment level. Check within your environment’s package list. You should see the package in the package list of your environment (see below fig 1.6). You can also create a test notebook outside the folder import the package and test (see below fig 1.7)

Fig 1.6-: Checking the package list
Fig 1.7 Environment level install test

--

--

Gaurang Mehra

Deeply interested in Data Science, AI and using these tools to solve business problems.