Make My Day…ta Science Easier

David Stevens
5 min readFeb 21, 2018

--

Okay, sit down. I know it was an incredible pun, but a standing ovation? Come on.

I bet you never gnu Makefiles were made for you.

TL;DR

Makefiles are easy. Makefiles will change your life. Every data scientist should be using Makefiles. You need Makefiles.

GNU make is a tool that should be in the toolbox of more data scientists. It uses something called Makefiles, which are natural maps to the type of pipelining work that data scientists do every day and allow us and others to reproduce work painlessly.

So we’re just gonna pretend gnus aren’t super weird looking? Cool.

The Only Things You Really Need to Know

  • a Makefile is just 👏 a 👏 list 👏 of 👏 rules 👏
  • The rules are composed like this:
target: prereqs
recipe
  • You have to indent with tabs. You’re welcome.
  • You execute the rules by calling make <target name> from the command line.
I’m literally hungry after reading the word recipe, what’s wrong with me… I saw brownies flash before my eyes. Maybe I should go make some haha. Ha.

Why do we care?

This makes your code task-oriented — anyone can enter your repo and see the series of steps necessary to perform a given task. Or better yet they can just run the makecommands and not even worry about the inner workings. You don’t have to wait for Pythonella, who just went on vacation last week and was the only one who knew how to update the weights for a model used in a report for the quarterly business review tomorrow.

But maybe this task-oriented speech is too abstract. I didn’t see the use in a fancy shmancy Makefile (*nose up, pinkie out*) because all of the examples referred to compiling C code and my densely-filled concrete block of a mind couldn’t make the jump to see how this would benefit me, Data Scientist extraordinaire™️. If you want your eyes to glaze over, I suggest checking out the Wikipedia page for Makefiles here.

For the purposes of illustration, we’ll take a machine learning project as our example. They’re notoriously hard to organize because of all the loose models and parameters and hyperparameters and data and plots strewn about. I’ll show you how makefiles can be used to make this process easy (breezy, beautiful, CoverGirl).

Applications

Time for a bunch of kind of exaggerated (but totally fair) comparisons.

I personally like to read Makefiles like a caveman in my head to keep things lively. make clean. make data. make fire. grunt. Try it. Live a little.

Setting up your project

Let’s start with a warmup. You’ve just cloned a repo with more dependencies than you ever seen in your entire lengthy life. First things first: time to set up the environment. In these examples, I’ll show you the old way (without a Makefile) and the new way (with a Makefile configured).

Old:
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
New:
make requirements

Clearing caches

Spring cleaning has never been easier. You know when you think your code changes aren’t working but you really just forgot to clear the cache and now you feel dumb af? That self-loathing could have been avoided had you just run make clean.

Old:
Hm... where was that cache you needed to clear? *greps for cache and name of project, maybe finds it*
rm -rf path/to/cache/you/maybe/found
New:
make clean

Syncing data between your computer and remote storage

Admit it — you’ve always wanted a refresh button for your data. I mean, it wasn’t #1 on your wish list but it was up there.

Old:
scp user@ip-address:/path/to/data/on/remote/server /path/to/data/on/local/machine
Ugh where's the data kept again? Was it on ec2-user@555.110.12 or kill-me-now@help?New:
make sync-to-S3

Behind the scenes you would have the following rule in your Makefile:

BUCKET = <your-S3-bucket>
sync_data_to_s3:
aws s3 sync data/ s3://$(BUCKET)/data/

Note the classy use of variables in Makefiles.

Linting — keeping your code spotless

I know we all loveeee setting up git pre-commit hooks but I’m here to take that joy away from you.

Old:
pip install precommit
precommit install
- Set up pesky hidden config file
- Select your linters and find commit hashes from sketchy mirrors on Github
- Wonder why it's still not working and gaze out the window while remembering your childhood aspirations
New:
make lint

Creating (reproducible) datasets

It’s in applications like these that the Makefile truly shines. Note that you can even pass in arguments from the command line! Creating your data set is usually a highly variable multi-step process involving SQL queries, maybe some preprocessing in Python, and then pushing to AWS S3 or some other type of storage. Makefiles can simplify pipelines like this into a single command

Old:
Where's Pythonella? I need the query she was using to pull data for this model. Oh well I'll just write it myself... *writes query in agony*... *suffers*
Run:
python preprocess.py /path/to/questionable/data
aws s3 cp /path/to/still/questionable/data /path/in/s3 --recursive
What sampling rate were we using? Are the class weights going to be all messed up now? (Yeah. Probably.)New:
make sample RATE=.10

This is clean, sets defaults, and bundle everything into a single step.

Executing your test suite

And finally because we always have unit testing suites with 100% coverage, we can set up:

Old:
pytest
New:
make test

You’re probably thinking “um this one’s actually longer…” And to that I’d say “oh.”

Wrapping it all up

Now you’ve just joined a new machine learning team that has built a deep convolutional neural network to perform object recognition for a certain tech company’s P̶r̶i̶m̶e̶ delivery drones. They say, “Welcome to the team!! Please retrain the model for a client demonstration tomorrow morning. It’s like really important. Thanx.”

Lucky for you, there’s a Makefile and you just read this article. You crack your knuckles and knock out:

make clean
make requirements
make data
make train
make evaluate
make presentation

And we all lived happily ever after.

Makefiles help the data scientist’s workflow immensely. We work with a lot of different tools, mixing and matching ad nauseum, and sometimes just trying to remember basic steps on a tool we haven’t used in a while can really slow us down. Makefiles help document and streamline the steps that need to be taken. They alleviate the human-knowledge transfer component that can hold up project because key developers are on vacation or no longer with the company. Finally, they help ensure reproducibility, helping to keep data science planted firmly in the realm of science.

The trifecta of a well-maintained data science project is git, a test suite, and make.

So how do you use Makefiles as a data scientist? What actions have you found particularly helpful? Is there anything you think I missed out on? Let me know in the comments below.

And if you found this useful or want to see more data science content like this, please recommend, like, or share.

Resources

Makefile example from cookiecutter-data-science [must click]
GNU Make manual [should click]

--

--

David Stevens

Machine Learning Engineer @ Peloton. ex-Uber Data Scientist.