Bridging the Gap: Converting Data Science Notebooks into Production ML Code

Published in

Marvelous MLOps

5 min readAug 23, 2023

If you are a data scientist, or data analyst who started and completed a promising ML solution that needs to be prepared for production and wonder how you can do it by yourself, this article is for you. We will touch on the minimum basics that are required to hand over your code to ML/MLOps Engineer. Of course, for actual deployment, many other tools and steps are required. We have introduced these in our previous article The Minimum Set of Must-Haves for MLOps.

Turning code implemented in a notebook into production-ready code involves several steps to ensure the code’s maintainability, scalability, and efficiency. In this article, we’ll walk through the process using examples, demonstrating how to refactor, modularize, configure, test, log, and package your code for a seamless transition from experimentation to deployment.

Step 1: Refactor code into functions

It’s a foundational practice in software development to refactor code into functions. Once you encapsulate a piece of code within a function, you can reuse it without rewriting or copying. When you need to change something, you only change a single function without having to search for the same code in the entire implementation. It also makes it easy to understand your code. Functions facilitate modular design, which allows different developers to work in different functions without causing merge conflicts. Also import only the functions you need from libraries, instead of the full libraries.

Notebook code:

import pandas as pd
data = pd.read_csv('data.csv')
filtered_data = data[data['column'] > 100]
print(filtered_data)

Refactored code:

from pandas import read_csv

def filter_data(filename, column, threshold):
    data = read_csv(filename)
    return data[data[column] > threshold]

filtered_data = filter_data('data.csv', 'column', 100)
print(filtered_data)

Step 2: Move to Classes

Another common practice in software development is adopting object-oriented programming principles. Structuring code using classes can make the codebase more organized, readable, and modular. With classes, you can bundle data (class attributes) and methods (class functions) into a single unit.

Refactored code:

from pandas import read_csv

class DataPreprocessor:
  def __init__(self, filename):
    self.data = read_csv(filename)
  
  def filter_data(self, column, threshold):
    return self.data[self.data[column] > threshold]

preprocessor = DataPreprocessor('data.csv')
filtered_data = preprocessor.filter_data('column', 100)
print(filtered_data)

Step 3: Separate tasks

Separate data loading, processing, and output (like printing or saving results). This makes the code easier to test and modify.

class DataLoader:
  def __init__(self, filename):
    self.data = read_csv(filename)


class DataPreprocessor:
  def __init__(self, data):
    self.data = data
 
  def filter_data(self, column, threshold):
    return self.data[self.data[column] > threshold]


loader = DataLoader('data.csv')
preprocessor = DataPreprocessor(loader.data)
filtered_data = preprocessor.filter_data('column', 100)

def display_data(data):
   print(data)

display_data(filtered_data)

Step 4: Isolate configuration

Create a configuration file or module to store hyperparameters and settings. Keeping configuration settings separate from the main code allows you to easily adjust and fine-tune parameters without modifying the main code. It is also easier to create test scenarios with different configuration settings.

Configuration file:

# config.yaml
learning_rate: 0.001
num_epochs: 10
batch_size: 32
hidden_units: 128
training_data_path: "data/train.csv"
validation_data_path: "data/validation.csv"

# main.py
import yaml

def main():
    with open("config.yaml", "r") as f:
      config = yaml.load(f)

Configuration module:

# config.py
class ModelConfig:
  learning_rate = 0.001
  num_epochs = 10
  batch_size = 32
  hidden_units = 128
 

class DataConfig:
  training_data_path = "data/train.csv"
  validation_data_path = "data/validation.csv"

# main.py
from config import ModelConfig, DataConfig

def main():
  model_config = ModelConfig()
  data_config = DataConfig()
  
# Use configuration settings in the rest of the code
print("Learning Rate:", model_config.learning_rate)
print("Training Data Path:", data_config.training_data_path)

if __name__ == "__main__":
  main()

Step 5: Add unit tests

Unit tests (and/or integration tests) are essential to ensure that the code behaves as expected. They ensure code correctness and reduce bugs.

While refactoring notebook code, pay attention to writing a unit test for each function you create. Writing tests for functions immediately as you create them is simpler than attempting to write all tests in one go.

def test_filter_data():
  test_data = pd.DataFrame({'column': [50, 150, 250]})
  preprocessor = DataPreprocessor(test_data)
  result = preprocessor.filter_data('column', 100)
  expected_result = test_data[test_data['column'] > 100]
  assert result.equals(expected_result)

Step 6: Move code to separate modules

Create a new Python file for each set of classes. This will make your codebase more modular and reusable. It will also keep your repository nice, tidy, and organized.

/my_project
├── data
│   └── data.csv
├── main.py
│
├── my_python_package
│   ├── data_loader.py
│   └── data_preprocessor.py
│
└── tests
    └── test_data_preprocessor.py

Step 7: Log

Logging provides a detailed record of what happened during the execution of your program. When errors or unexpected behavior occur, these logs can be useful for diagnosing and debugging issues.

# logger.py

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# main.py
from logger import logger

def main():
  logger.info("Starting the application…")
# …

Step 8: Document

Document your code. Add comments to complex code lines, docstrings to each function, and external documentation to explain the purpose, and usage of the code. This makes it easier for collaborators to understand and work on the same codebase.

from pandas import read_csv

def filter_data(filename, column, threshold):
    """Filters dataset based on given threshold

    Args:
        filename (str): filename to the dataset
        column (str): column to filter
        threshold (int): threshold to filter

    Returns:
        pd.Dataframe : filtered dataframe
    """
    data = read_csv(filename)
    return data[data[column] > threshold]

filtered_data = filter_data('data.csv', 'column', 100)
print(filtered_data)

Step 9: Package and use dependency management

You can use dependency management tools to ensure that you have a consistent and replicable environment. This way, your code behaves the same way across different setups; development, acceptance, and production. You can specify which version of a library or package that you want to use. It also avoids dependency hell, which is the situation where different parts of an application rely on incompatible versions of a library. Tools like pipenv, conda, or poetry help manage dependencies. Creating a requirements.txt or similar files ensures that the production environment matches the development environment.

Apart from the package dependency tools, a fairly failsafe and accessible way to manage your code is by packaging it. The solution and repo are essentially a wrapper around your entire solution, but the Python package is a wrapper around all your Python code that should run in production. Ideally, you could build images to start ensuring reproducible runs (provided you have reproducible data inputs).

/my_project
├── data
│   └── data.csv

├── my_python_package
│   ├── __init__.py
│   ├── data_loader.py
│   └── data_preprocessor.py
│
├── main.py
│
└── tests
    └── test_data_preprocessor.py

Step 10: Hand over your code through version control

Please do not e-mail a zip file. Version control systems will offer you many options to work together with your MLOps Engineer, making cooperation much easier!

While Notebooks are great for developing data and ML solutions, the true value of these solutions is harvested when they run in production. As technology advances, going from experimentation and development to production will get easier. Until then, you can use these steps and best practices to make this shift smoother.