A Practical Guide to Writing Production-Quality Data Science Code

How to go from Notebooks to production-ready code easily

Published in

Contino Engineering

12 min readNov 30, 2022

The real value of data science is when the outputs from your Machine Learning (ML) models are used by others, which depends on them going into production. This is evidenced by the rise of the ML engineer, a role dedicated to ensuring ML models work well in production.

Software engineers have developed practices to ensure code is production-ready by focusing on production up-front. Test Driven Development, for example, involves writing tests that specify how your code should function before writing any of the functional code.

However, these approaches do not generalise to Data Science.

ML projects typically start out with so many uncertainties that prioritising writing production-quality code up-front can be impractical. We need to know what data issues exist? What patterns the data exhibits? What performance is possible? Without answers to these questions, we do not know whether we should even put anything into production.

For this reason, it is difficult to convince data scientists to shift away from the common practice of organically developing Jupyter Notebooks as they explore the data and potential models. The downside of this is that Notebooks can lead to code that is potentially broken, difficult to read, difficult to update and difficult to maintain.

Fortunately, it turns out we can quite easily rework the code from a Notebook into code that is production-ready; code that will not elicit disapproving groans from the engineers responsible for putting it into production.

In this blog, we’ll walk through a practical set of steps to turn code from a Notebook into production-ready code. By “practical” I mean the minimum effort required to avoid maximum headaches in the future.

Turning a Notebook into Production-Ready Code

We will use this notebook from AWS. It is well written but is designed to educate, not productionise. It is worth having a quick skim of the Notebook to give context to the rest of this blog.

The resulting code is available here.

Side note: This blog does not cover any of the supporting systems that will ideally be set-up to support ML models in production, such as code repositories, data pipelines, model monitoring, feature stores, scheduling, DevOps, scalable infrastructure etc. These systems fall more into the ML Engineering and MLOps space, which you can read more about here, here and here.

Install packages into a virtual environment

Notebook code to install required packages

The first cell in the Notebook installs some Python packages from pip. However, every package installed this way will update the packages available for other projects on your system. This makes it almost impossible to track which packages (and which versions) are needed by each project.

Virtual environments allow you to manage packages for each of your projects separately. This allows you to avoid clashes between projects and makes it much clearer which packages are needed by your current project.

I like to use pipenv to manage my virtual environments. It automatically creates a virtual environment for each new project and tracks every package version that you install.

This is as simple as running the following code in bash

pip install --user pipenv
cd your/project/folder
pipenv install --python 3.9 sagemaker pandas numpy

This will create a virtual environment that is linked to your project folder, with Python 3.9 installed in the virtual environment, along with sagemaker, pandas and numpy.

Going forward, you just call pipenv install [package_name]instead of pip install [package_name]when you want to install other Python packages.

If you want more info on pipenv and virtual environments, a good guide is available here.

Parameterise hardcoded variables

Example from the notebook where the arguments to download_file() are hardcoded

Example from the notebook where model instance settings and hyperparameters are hardcoded

In the notebook, we see various instances of parameters being hardcoded, meaning that they are assigned a set value in the source code. This can make it hard to update these parameters. For example, if users are wanting to run experiments with different hyperparameters, they need to change the source code (and potentially redeploy the entire code base) for each experiment.

The simplest solution is to define these parameters in a config file and point your ML pipelines to this config file when they are run. When you run a new experiment just point to the new config file. This makes it easy to track what the parameters were set to in any given experiment (by saving the config file) and allows you to update the parameters without having to edit and redeploy the source code.

Typically these config files are defined in json or yaml. I prefer yaml because it is easier to read.

Example yaml config file. It is easy to read and we can quickly see all the parameter settings for the project

You can then import this file as a dictionary and reference it where desired. For example, defining the model might look like:

def define_model(params, env="DEV"):
    xgb = sagemaker.estimator.Estimator(
        container=helpers.get_training_container(),
        role=params["role"],
        instance_count=params["SETTINGS"]["instance_count"],
        instance_type=params["SETTINGS"]["instance_type"],
        output_path=params[env]['OUTPUT']['output_folder'],
        sagemaker_session=sagemaker.Session(),
    )

    # Note: the ** is a Python trick that unpacks the dictionary and passes 
    # each key:value pair as named arguments. This just helps make the code
    # a bit easier to write and update because you can add new hyperparams 
    # to the config file without having to add them to the source code. 
    xgb.set_hyperparameters(**params["XGB_HYPERPARAMETERS"])

# Load the config file
with open("config/config.yaml", "r") as stream:
    params = yaml.safe_load(stream)

# Define the model
xgb = define_model(params, env="DEV")

Move code into modular functions

Example of code that is duplicated in the notebook. This code is hard to read and if we update it, we need to update it twice.

In the notebook, all of the code is defined in a single file. If we instead move code out into functions we gain a number of benefits:

Reduced duplication of code: Allows us to update a function once and then all calls to that function get updated.
Easier to read: Instead of reading the above block of code that uses a boto3 session to access an object in a bucket and upload a file to it, we could just read upload_file(source_location, target_location).
Enables unit testing: We’ll talk about this below.

I like to have three levels of functions:

helpers = functions that are completely independent from the rest of your code. These are the building blocks that perform one small, specific task (e.g. uploading a file)
Steps = Combine various helper functions together to perform an easily explainable task (e.g. clean the data)
Pipelines = Combine various steps to perform a job from start to finish (e.g. a training pipeline might clean the data, train a model, deploy that model, generate predictions on test data and then evaluate the predictions).

I find this approach to be an intuitive breakdown of how I approach ML code. First I think about the overall outcomes I need to generate (e.g. a trained and evaluated model), which are achieved with the pipeline. Then I break that down into the steps to achieve that outcome. For each step, I plan out the specific tasks, which I then implement in helper functions.

Conceptual example of a pipeline, a step and a helper function

By separating out the core functionality into seperate helper functions, it becomes much easier to understand what the steps and pipelines are doing while also reducing code duplication (we can call a helper twice instead of copy and pasting code). As you will see in the next section, it also makes it much easier to implement tests to help make sure our code is functioning correctly.

Side note: To make sure we can import these functions, we need to add an empty file called __init__.py to each folder in our project containing python files. This just allows the folders to be recognised as packages so we can import the functions just like other packages, e.g. from steps import clean_data. You can optionally also package your code and share it so that others can pipenv install it (see here).

Add Unit Tests

Unit testing allows us to specify how individual pieces of our code should work and then run automated tests to make sure they are working as expected. This means that we can be confident that our functions are working as expected now and will continue to work as expected even after updates to the code are made.

The reason we wanted to seperate our core functionality out into seperate, isolated functions is for this unit testing. We want to know that each individual piece of our code works. The more focused each of our functions are, the easier it is to test them. If our code is all just in a single script or notebook, it becomes very hard to write tests or to decipher where errors occurred.

There are two common ways of implementing these tests in Python: the unittest package and the pytest package. Pytest tends to be easier to pickup and more readable, so let’s use it.

Let’s take the train_test_split function as an example to show what unit testing looks like.

import pandas as pd
import numpy as np
from src.units.helpers import train_test_split

def test_train_test_split():
    df = pd.DataFrame(
        {
            "col1": np.arange(12),
            "col2": np.arange(12),
        }
    )
    train_data, validation_data, test_data = train_test_split(
        df, train_frac=0.5, validation_frac=0.33
    )
    assert len(train_data) == 6
    assert len(validation_data) == 4
    assert len(test_data) == 2
    assert set(df.columns) == set(train_data.columns)

What we have done here is create a test by writing a function that has “test_” at the start of its name and then asserts some requirements at the end of it. Pytest will automatically find any functions starting with “test_” as long as they are in a .py file that also starts with “test_”.

In this test function, we define a small dataframe that we can pass to our function, train_test_split. After calling our function, we assert what results we expected to get (assert returns an error if the following statement is not True)

In this case, we expect to get half of the rows in train_data, one third of the rows in validation_data and the remaining rows in test_data. And we expect the columns to remain the same.

While there are other things you can do with Pytest, unit tests can be this simple. A great overview of the other things you can do with Pytest is available here, but reading that article will make you understand why I felt the need to offer a more practical/simple approach in this article.

Now let’s see if the function we adapted from the notebook passes this test. We do this by running pytest in the terminal.

The results show that instead of validation_data having four rows, it only has three. Upon exploration, we find this is because we set validation_frac to 0.33, which, when multiplied by 12 gives 3.97. It seems that is getting rounded down to three instead of rounded off to four, as desired. After updating the function to round off instead of round down, we run the test again and find that it passes.

Some other functions are not as nicely isolated as the train_test_split function. For example, to download files a call to AWS S3 is made:

def download_file(source_location, target_location, source_bucket=None):
    s3 = boto3.client("s3")
    s3.download_file(source_bucket, source_location, target_location)
    return None

None of your unit tests should be relying on other functions or external systems (e.g. the S3 API) because that makes it possible for our tests to fail even when our code is good. For example, if the file on S3 is deleted or the API is down, our test will fail even if our code is all correct.

The solution to this is to “mock” any calls to external systems, meaning that we pretend to interact with these systems but don’t actually. However, mocks are probably more effort to learn and implement than the rest of this blog combined. So we will choose to test these functions with integration tests instead.

If you are keen to learn about mocks, a good introduction is available here.

Side note: You can also automatically run your unit tests before every Git commit by setting up Git pre-commit hooks, as outlined here.

Add Integration Tests

Unit testing is valuable for speeding up development but it does not tell you whether the code works overall. Integration testing involves checking that your various pieces of code work together and work with other systems (e.g. external APIs, databases etc.).

This is crucial to ensure the proper functioning of your application but these tests take longer than unit tests so we can’t run them as often. Where we might run unit tests every time we commit code to a repository or submit a Pull Request, we may only run integration tests before a new build is released, with additional manual runs of the tests as desired.

An integration test can be as simple as calling a function and making sure that it produces the outputs we expect.

from src.pipelines.train_and_evaluate import train_and_evaluate

def test_train_and_evaluate():
    accuracy = train_and_evaluate(
        config_path="tests/assets/config.yaml"
    )
    assert accuracy >= 0 and accuracy <= 1

This is a short example where we just call our entire train and evaluate pipeline and make sure it runs from start to finish. I’ve added one assertion here that the accuracy output is between 0 and 1. You would want to add additional assertions that any other outputs/side effects your function should produce are also being produced as expected.

We can run just the unit tests or integration tests alone by pointing to their folder. For example to just run the integration tests, we call pytest tests/integration_tests.

Add documentation

Notebooks can be great for explaining what you are doing throughout the code. As we move away from Notebooks for production, we want to keep this documentation going.

A README.md file is the standard way of adding high-level documentation to a project. Typically it will contain an overview of the project, the steps needed to run or deploy code in the project and any other relevant notes or links.

Hopefully your functions at this point are simple enough that they are understandable without documentation. However, it can make life easier if you document them too with Docstrings. Docstrings are just a comment at the start of files or functions. There are various packages available (e.g Sphinx) that can extract info from your Docstrings to automatically generate prettier documentation. Below is an example of a Docstring (and type hinting) following Google’s Docstring style guide:

def load_settings(param_path: str, env: str = "DEV") -> dict:
    """Load settings from a config file and add some extra settings in

    Args:
        param_path: Path to a file containing parameters
        env: One of "DEV" or "PROD", representing whether to load settings
            for the development or production environment

    Returns:
        A dictionary containing the relevant settings
    """
    settings = helpers.load_params(param_path, env=env)
    settings["bucket"] = helpers.get_default_bucket()
    settings["role"] = helpers.get_user_role()
    return settings

Summing Up

It is easy to feel overwhelmed when reading about good software engineering practices. This blog post has aimed to draw out the most important lessons from the software engineering world to make production-ready code approachable and quick.

By following these steps we can extract the code from an exploratory Notebook into code that is ready to provide ongoing value.

Your code will be easier for engineers to put into production. Once in production, it will be much easier to review, test and deploy updates to your code or model. This is aligned with the DevOps concept of CICD, which allows you to add functionality or fix problems within a day rather than within weeks.

The overall result of these changes, is that it will be more likely that your models gets into production and that it remains operational once there. All your effort developing the ML model feels much more justified when you can see it being used by others and providing them value.

If you are interested in learning about some of the others aspects to keeping models working well in production, this falls under the banner of MLOps. You can read more about who is responsible for MLOps tasks here and you can see how to implement some MLOps concepts in GCP here.