Driving experiments with make

Lars Yencken
Lifesum: Healthy living, simplified.
5 min readFeb 16, 2016

In our drive to understand the world better, we rely more and more on data to answer our questions. For this to work, we need to be productive and our analysis needs to be repeatable so that we and others can trust it.

This post explains a convention that we’re trying at Lifesum for new experiments, to make any analysis we do on small datasets easier to share and easier to come back to months later.

Folder structure

We have an analytics git repo, with a one-folder-per-experiment structure that looks something like this:

my-experiment/
README.md
Makefile
requirements.txt
src/
input/
output/

The README.md is the first thing I add, along with a statement of what the goal of the experiment is. Later, once there’s more structure, I come back and describe what dependencies the experiment has, and how to re-run it. Usually that’s as simple as typing make.

Running the experiment will mean that missing input files get automatically downloaded into input/, and then any derivative data will end up in output/.

Using Make

A few years ago, Mike Bostock wrote an excellent article “Why Use Make” which described the advantages Make brings to data workflows. Make brings you:

  • Smart dependencies It automatically downloads missing files and rebuilds targets when their dependencies change
  • Documentation It serves as a machine-readable way of documenting your workflow
  • Already installed It’s available by default nearly everywhere

If you’re not familiar with Make, here’s a quick example. You might create a Makefile with a rule like:

output.csv: input.csv myscript.py
python myscript.py input.csv output.csv

Then when you invoke make with a target, it will execute your rule to try to create it:

$ make output.csv
python myscript.py input.csv output.csv

Cleverly, it only re-runs the rule if the target is out of date:

$ make output.csv
make: 'output.csv' is up to date.

When your steps are expensive, this saves a lot of time. You can chain a large series of rules together, and Make will happily run only the steps you need to in order to build your target.

Sandboxing dependencies

For Python analyses, we can go an extra step and sandbox any Python dependencies:

env:
pyvenv env
env/bin/pip install -r requirements.txt

This lets us run make env and have a virtual environment with everything we need, ready for use. If we need a specific python version, a .python-version file will allow pyenv users to automatically switch to that version for this experiment. This lets you default to Python 3, but occasionally drop back to 2.7 if a package requires it.

If you do this, have the wheel package installed, so that pip caches local binaries of common data analysis packages. To save even more time, consider running devpi on your development machine to make fetching packages even faster.

Working with notebooks

IPython/Jupyter notebooks have been a huge boon to data analysis in Python. They make exploring and working with data easier, by bringing rich HTML output of tables with pandas and easy graphing with matplotlib. I used to use a notebook to explore, then extract code that cleaned or transformed data into its own regular Python script.

Now, I often skip that step by using the runipy package. With it, I can run the IPython notebook as if it were a non-interactive script, and skip the rewrite step.

Making inputs repeatable too

In the past, I often had the problem that an experiment needed a specific dataset in the input/ directory to run. I was pretty good at documenting it, but often it would be a hassle to get that data. Even I might forget how it was originally generated.

Now, instead I store datasets in Amazon S3, and pull them down on the command-line as needed with a rule like:

input/user-foods.csv:
aws s3 cp s3://lifesum-analytics/.../user-foods.csv $@

This takes away another obstacle to having an experiment work the first time you try to run it.

Handling credentials with keyring

If one of your steps involves querying a database, credentials can be a real hassle. You don’t want to commit database passwords into your analytics repo with your code. But keeping them around in environment variables feels a little icky too, and asking for them every time slows you down. Here, the keyring module shines.

In your Python code, your code to connect to your database becomes something like:

import MySQLdb
import keyring
conn = MySQLdb.connect(
host='db.example.com',
db='dbname',
user='read-only',
passwd=keyring.get_password('mycompany-dbname', 'read-only')
)

On OS X, this causes keyring to check the login keychain for that password. You can then instruct the user to set the password on the command-line:

$ keyring set mycompany-dbname read-only
Password for 'read-only' in 'mycompany-dbname':

This way you end up with a more secure way to store sensitive credentials, without overly complicating your experiment scripts.

Things that don’t work well

This approach works really well for experiments on small data, but there are some drawbacks still:

  • One target per command Make only expects one target, and isn’t clever about handling commands that generate several useful outputs for you. Drake is better in this case.
  • Ugly syntax Make’s not particularly pretty, and things get harder to read the more sophisticated your usage gets.
  • Boilerplate At some point you may be generating a lot of similar targets, and find yourself writing boilerplate commands, opening the way for typos and copy-paste bugs.
  • Single repo With multiple experiments on the go, having just one repo to house them all eventually leads to git status messages that make you pull your hair out. House-cleaning each one as you go becomes important.

Despite these occasional pain points, using this convention for experiments and driving them with Make is a very productive way of working. It lets you document your work and of make it repeatable at the same time. This makes life easier, both for others, and for future you.

Do you have a similar set of experiments that you manage with a different approach? We’d love to hear about it. Write to us, or blog a response!

Originally published at lifesum.github.io on January 14, 2016.

--

--