Welcome to the Big Leagues

Andrew Fowler
GAMMA — Part of BCG X

--

Ready to expand your data science from a math problem to a full-blown software engineering project? Here are three steps to “productionizing” your code.

You’ve finally finished tweaking and tuning your model! You beat all reasonable (and unreasonable) baselines by an enormous margin. You cross-validated on every fold, slice, and holdout — every internal, external and eternal dataset you could get your hands on. The analysts at your company have determined that this model will create so much value that upper management can buy a new building and name it after you. It’s beautiful. Now it’s time to productionize.

“Productionization” is a bit of a buzzword whose definition can change depending on who you ask. I think of productionization as the inflection point where a data science project changes from a math problem to a software engineering project. This comes with a handful of concrete requirements:

  • The code can run on systems other than just your laptop.
  • The parameters of the models need to be changeable without changing the source code.
  • The code must be checked for errors every time it’s changed.

While it isn’t really fair to expect most data scientists to be able to convert a valuable model into a full-fledged software product, a clever data scientist can prepare a model pipeline for production by making sure it has these three things:

  • Command line executability
  • Externally configurable code
  • A configuration that runs in under 2 minutes

For the sake of this article, we’ll call the following snippet of code our “pipeline”:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
# Read in training data
training_data = pd.read_csv(‘./data/training_data.csv’)
X = training_data[[‘foo’, ‘bar’, ‘baz’]]
y = training_data[‘y’]
# Read in scoring data
scoring_data = pd.read_csv(‘./data/scoring_data.csv’)
X_pred = scoring_data[[‘foo’, ‘bar’, ‘baz’]]
# Fit your model
regr = RandomForestRegressor(n_estimators=100)
regr.fit(X, y)
# Make your predictions
y_hat = regr.predict(X_pred)
# Output to a csv
np.savetxt(‘./data/output.csv’, y_hat, delimiter=’,’)

We’ll assume that this code is currently being run from a jupyter notebook called model.ipynb. Now onwards to our pre-production checklist!

1. Command Line Executability

Why do we need this?

This is specifically for folks who do most of their development work from notebooks. Working in a notebook is good because it enables the fast-feedback loop of micro-iterations during the model-building process. On the other hand, it tethers your model to the software you use to create it. Command line executability streamlines the process of adapting your model to run on other systems. I recommend making your code command-line executable for a few reasons:

  1. You don’t have to go through the overhead of setting up a notebook server if you want to run your code somewhere other than where you wrote it.
  2. The process of scheduling tasks with anything that can be run from the command line is straightforward (but less so with notebooks).
  3. Ensuring that the code is being run from the correct environment is much more straightforward.

How is it done?

For a model like our “pipeline,” this step is trivially easy. All we have to do is copy our source code from our notebook model.ipynb to a python script called model.py. To run the code, we simply type python model.py into your command prompt. That said, we’re going to make one minor adjustment to make sure our code is still importable in other python scripts (changes in bold):

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
def run():
# Read in training data
training_data = pd.read_csv(‘./data/training_data.csv’)
X = training_data[[‘foo’, ‘bar’, ‘baz’]]
y = training_data[‘y’]
# Read in scoring data
scoring_data = pd.read_csv(‘./data/scoring_data.csv’)
X_pred = scoring_data[[‘foo’, ‘bar’, ‘baz’]]
# Fit your model
regr = RandomForestRegressor(n_estimators=100)
regr.fit(X, y)
# Make your predictions
y_hat = regr.predict(X_pred)
# Output to a csv
np.savetxt(‘./data/output.csv’, y_hat, delimiter=’,’)
if __name__ == “__main__”:
run()

By adding if __name__ == “__main__”: at the end of our script, everything below this statement will execute whenever the script is called as an executable. It also means that if we want to put our model script as a module into another script, we can import it as model.run().

2. Externally Configurable Code

Why do we need this?

In a business environment, speed of iteration is crucial. As the complexity of your pipeline grows, you run the risk that any time you change the code something will break, causing unforeseen delays. For this reason, it’s a good idea to be able to change some of the values in your modeling pipeline without changing the code itself.

How is it done?

We’re going to use an external config file for this task. There are several ways to do this, but a good standard option is the YAML format. YAML files can be read in python using the pyyaml package. From the code above, there are four values we might want to change: the data location, columns for our X variables, columns for our Y variables, and the number of estimators in our random forest model. We’ll call this YAML file config.yml. Here’s what it looks like for our toy code:

Config file:

DATA_PATH: ‘./data’
X_VAR: [‘foo’, ‘bar’, ‘baz’]
Y_VAR:[‘y’]
N_ESTIMATORS: 100

Pipeline code (changes in bold):

import pandas as pd
import numpy as np
import yaml
from sklearn.ensemble import RandomForestRegressor
def run(config_path=’./config.yml’):
# Read in yaml file
with open(config, ‘r’) as ymlfile:
config = yaml.safe_load(ymlfile)

# pull out paths up here
training_path = config[‘DATA_PATH’] + ‘training_data.csv’
scoring_path = config[‘DATA_PATH’] + ‘scoring_data.csv’
output_path = config[‘DATA_PATH’] + ‘output_data.csv’


# Read in training data
training_data = pd.read_csv(training_path)
X = training_data[config[‘X_VAR’]]
y = training_data[config[‘Y_VAR’]]
# Read in scoring data
scoring_data = pd.read_csv(scoring_path)
X_pred = scoring_data[config[‘X_VAR’]]

# Fit your model
regr = RandomForestRegressor(
n_estimators=config[‘N_ESTIMATORS’]
)
regr.fit(X, y)

# Make your predictions
y_hat = regr.predict(X_pred)
# Output to a csv
np.savetxt(output_path, y_hat, delimiter=’,’)
if __name__ == “__main__”:
run()

Now any changes we make to the config file will be updated in our model code. But wait! What if we want to run a different config? Great question. Using the click package, we can make the path to the config accessible via command line:

import pandas as pd
import numpy as np
import yaml
from sklearn.ensemble import RandomForestRegressor
import click
@click.command()
@click.argument(“config_path”)
def run(config_path=’./config.yml’):
...
...

Once we run our python script, we run it like like this

python model.py ./config.yml

Boom! Your pipeline is now runnable and configurable via the command line. Side note: that @ symbol before the click functions denotes a python function decorator. They’re super useful for making your code easier to read!

3. A configuration that runs in under 2 minutes

Why do we need this?

As model complexity grows, so too does runtime. In my experience, if it takes more than a couple minutes to run a pipeline end to end, it is unlikely that the code it will be tested regularly. This is not good because:

  • Bugs and errors will be discovered only at runtime.
  • Changes will take hours to validate.
  • Unexpected delays will cause project managers to spontaneously combust as milestones are missed.

What we want is a dataset and settings that allow us to quickly do an end-to-end run of the pipeline. This dataset and group of settings will be our test case. Bear in mind that the faster a test is, the more frequently it will be run. The more frequently a test is run, the easier it will be for you, the data scientist, to sleep at night, knowing that your code still runs whenever you make a change.

How is it done?

Let me be clear here: When I say “runs in under two minutes,” I don’t want you to optimize every last drop of fat out of your pipeline, or bump up your cluster size so you rip through your full 250 TB dataset while sucking up enough electricity to get a call from your local utility company. Let me further say that the 2 minutes proposed here is aspirational: I tend to lose my train of thought if my code takes longer than 2 minutes to run, and I assume other people do as well.

What I’m talking about is sampling your data so it’s mostly representative of your full dataset, and choosing hyperparameters for your models that allow them to run quickly (even if they do so inaccurately). With that in mind, here are some tips and tricks on how to sample your dataset:

  1. For a truly random sample, both pandas and spark have a “sample” method to grab random rows of data.
  2. For samples that have multiple rows associated with them (such as different days at the same store), np.random.choice is a useful trick for grabbing a few items from a list.
  3. If there are known edge cases in your dataset and you want to make sure that your pipeline can handle them, make sure your sampled data contain those cases.
  4. If possible, make sure your sampled dataset is < 2MB for easy storage in your git repo. This will make your pipeline testable wherever it’s cloned without having to configure a database connection.

So let’s say that we sampled our training and scoring data and stored them in a new data folder that we call test_data. We also want to run our random forest regressor on fewer estimators to make it run faster. We’ll call our new config that points to this data test_config.yml and we’ll store it in our working directory. The contents of the file will look like this:

DATA_PATH: ‘./test_data’
X_VAR: [‘foo’, ‘bar’, ‘baz’]
Y_VAR:[‘y’]
N_ESTIMATORS: 1

Now when we want to run out the model from the command line, we’ll input the following command:

python model.py ./test_config.yml

And there you have it! This brings your model much closer to production. With these three steps as a foundation, you’ll have many more options for future deployment. You’ll be able to use advanced orchestration tools like airflow, test parts of the pipeline with py.test, and even run simple performance benchmarking on our sample data. Next time you have a model that’s ready for the big leagues, these three steps will bring you right up to the doorstep of an enterprise solution for your code.

--

--