Makefiles are easy. Makefiles will change your life. Every data scientist should be using Makefiles. You need Makefiles.
GNU make is a tool that should be in the toolbox of more data scientists. It uses something called Makefiles, which are natural maps to the type of pipelining work that data scientists do every day and allow us and others to reproduce work painlessly.
The Only Things You Really Need to Know
- a Makefile is just 👏 a 👏 list 👏 of 👏 rules 👏
- The rules are composed like this:
- You have to indent with tabs. You’re welcome.
- You execute the rules by calling
make <target name>from the command line.
Why do we care?
This makes your code task-oriented — anyone can enter your repo and see the series of steps necessary to perform a given task. Or better yet they can just run the
makecommands and not even worry about the inner workings. You don’t have to wait for Pythonella, who just went on vacation last week and was the only one who knew how to update the weights for a model used in a report for the quarterly business review tomorrow.
But maybe this task-oriented speech is too abstract. I didn’t see the use in a fancy shmancy Makefile (*nose up, pinkie out*) because all of the examples referred to compiling C code and my densely-filled concrete block of a mind couldn’t make the jump to see how this would benefit me, Data Scientist extraordinaire™️. If you want your eyes to glaze over, I suggest checking out the Wikipedia page for Makefiles here.
For the purposes of illustration, we’ll take a machine learning project as our example. They’re notoriously hard to organize because of all the loose models and parameters and hyperparameters and data and plots strewn about. I’ll show you how makefiles can be used to make this process easy (breezy, beautiful, CoverGirl).
I personally like to read Makefiles like a caveman in my head to keep things lively. make clean. make data. make fire. grunt. Try it. Live a little.
Setting up your project
Let’s start with a warmup. You’ve just cloned a repo with more dependencies than you ever seen in your entire lengthy life. First things first: time to set up the environment. In these examples, I’ll show you the old way (without a Makefile) and the new way (with a Makefile configured).
pip install -r requirements.txtNew:
Spring cleaning has never been easier. You know when you think your code changes aren’t working but you really just forgot to clear the cache and now you feel dumb af? That self-loathing could have been avoided had you just run
Hm... where was that cache you needed to clear? *greps for cache and name of project, maybe finds it*
rm -rf path/to/cache/you/maybe/foundNew:
Syncing data between your computer and remote storage
Admit it — you’ve always wanted a refresh button for your data. I mean, it wasn’t #1 on your wish list but it was up there.
scp user@ip-address:/path/to/data/on/remote/server /path/to/data/on/local/machineUgh where's the data kept again? Was it on email@example.com or kill-me-now@help?New:
Behind the scenes you would have the following rule in your Makefile:
BUCKET = <your-S3-bucket>
aws s3 sync data/ s3://$(BUCKET)/data/
Note the classy use of variables in Makefiles.
Linting — keeping your code spotless
I know we all loveeee setting up git pre-commit hooks but I’m here to take that joy away from you.
pip install precommit
- Set up pesky hidden config file
- Select your linters and find commit hashes from sketchy mirrors on Github
- Wonder why it's still not working and gaze out the window while remembering your childhood aspirationsNew:
Creating (reproducible) datasets
It’s in applications like these that the Makefile truly shines. Note that you can even pass in arguments from the command line! Creating your data set is usually a highly variable multi-step process involving SQL queries, maybe some preprocessing in Python, and then pushing to AWS S3 or some other type of storage. Makefiles can simplify pipelines like this into a single command
Where's Pythonella? I need the query she was using to pull data for this model. Oh well I'll just write it myself... *writes query in agony*... *suffers*Run:
python preprocess.py /path/to/questionable/data
aws s3 cp /path/to/still/questionable/data /path/in/s3 --recursiveWhat sampling rate were we using? Are the class weights going to be all messed up now? (Yeah. Probably.)New:
make sample RATE=.10
This is clean, sets defaults, and bundle everything into a single step.
Executing your test suite
And finally because we always have unit testing suites with 100% coverage, we can set up:
You’re probably thinking “um this one’s actually longer…” And to that I’d say “oh.”
Wrapping it all up
Now you’ve just joined a new machine learning team that has built a deep convolutional neural network to perform object recognition for a certain tech company’s P̶r̶i̶m̶e̶ delivery drones. They say, “Welcome to the team!! Please retrain the model for a client demonstration tomorrow morning. It’s like really important. Thanx.”
Lucky for you, there’s a Makefile and you just read this article. You crack your knuckles and knock out:
And we all lived happily ever after.
Makefiles help the data scientist’s workflow immensely. We work with a lot of different tools, mixing and matching ad nauseum, and sometimes just trying to remember basic steps on a tool we haven’t used in a while can really slow us down. Makefiles help document and streamline the steps that need to be taken. They alleviate the human-knowledge transfer component that can hold up project because key developers are on vacation or no longer with the company. Finally, they help ensure reproducibility, helping to keep data science planted firmly in the realm of science.
The trifecta of a well-maintained data science project is git, a test suite, and make.
So how do you use Makefiles as a data scientist? What actions have you found particularly helpful? Is there anything you think I missed out on? Let me know in the comments below.
And if you found this useful or want to see more data science content like this, please recommend, like, or share.