Organizing Your Next Data Science Project to Minimize Headaches

Call it the data scientist’s curse, but every practitioner has had a project that became unmanageable at some point because of poor organizational choices early on. We’ve all been at our desks at 2 a.m. changing values and re-running our scripts for the 80th time in an hour, asking ourselves where it all went wrong. It’s really easy to dig yourself a hole you can’t climb out of because of bad choices during the planning phase of your project.

At ODSC West 2018, Caitlin Malone will present on how to take your stream-of-consciousness, notebook-based data science project and turn it into something future-proof by restructuring it as software. For now, let’s talk a bit about how we can set up our preliminary analyses so that we don’t step on our own toes in the long run.

It’s All About the Files

Start by setting up your workspace in a way that makes sense for what you’re trying to accomplish, and really think about it. Maybe your top-level directory will include just your Jupyter notebooks, while a separate folder is reserved for your data and another for data dictionaries and notes.

That’s just one way of slicing it. What I do is create a data directory with input, intermediary and output subdirectories. The intermediary directory is just a place to dump files every time I have to write out my data during a different part of a script. This way, if I have to write a file out halfway through the project, I don’t lose track of where it went or confuse it with anything I’m attempting to read in or write out as a final output.

In my input folder, I’ll create separate directories for each of the datasets I include, just in case there are multiple files in a collection. This can get messy if your datasets use long names, but it’s one of the easiest ways to keep everything separated out nicely.

Be a Better Steward of Your Environment

For example, if I have an API key that I’d rather others didn’t have access to, this is one of the best ways to make sure that they never see it. I can include my key as a variable in my user-specific environment, meaning no one else will be able to read or see it. Not to mention the fact that it’s a much, much more elegant solution than storing it in a specific text file your script accesses. I’ve seen this in practice and it is not recommended. Using environment variables will make your life much easier in the long run.

Be Efficiency-Minded

Don’t do that. Make sure that what you’re doing is time-efficient and space-efficient as much as possible. When dealing with large datasets, this will save you from wasting tons of downtime twiddling your thumbs while a script runs. Additionally, if your script is taking that long to run, it’s likely that your code is a rat’s nest and you’ll be lost if you revisit it months later.

One last tip: break out large computations into separate modules within the script. If I have a long pipeline, I’ll write out the intermediary results of expensive computations to disk, then check in my script if that file already exists so I don’t need to process it again. The Python pseudocode might look something like this:

import osdef phase_one():
if(os.path.isfile('phase_one_output.file')):
return
#Code for phase one goes here
def phase_two():
if(os.path.isfile('phase_two_output.file')):
return
#Code for phase two goes here
def phase_three():
if(os.path.isfile('phase_three_output.file')):
return
#Code for phase three goes here
def main():
phase_one()
phase_two()
phase_three()
If __name__ == '__main__':
main()

These are just some quick tips for organizing your next data science project to sidestep a lot of the problems that sloppy work will create. Listen to Malone’s talk at ODSC 2018 West for a more in-depth explainer on how to take your data science projects and turn them into useful software for the long run.

Original story here.

— — — — — — — — — — — — — — — — — —

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store