Data Directory in Jupyter Notebooks

Nilo Araujo
Nov 2, 2020 · 3 min read

Managing access to data files in interactive computing

Image for post
Image for post
Photo by Hudson Hintze on Unsplash

Summary

Almost every notebook contains a or a similar command to load data. Dealing with file paths in notebooks, however, is kinda troublesome: moving notebooks around becomes a problem, and the notebook now has to know project locations. Here, we discuss a couple of approaches to handle this problem.

Introduction

Starting a notebook is always easy, you just start a couple of cells which often just contain a . However, as the project grows (and in industry they always do), you will need to organize your folders. You will need a folder for the data, and another folder for notebooks. As the EDA progresses, you will need more folders representing different subsections of the main analysis. On top of that, your project should be reproducible, so that your peers can download the code, run the script, and it will work as intended, hopefully yielding the same results you had :)

So, if you have a on your notebook, moving it from one folder to another will require a change in the code. This is undesirable, we would like it to work regardless of its location. You could solve this by using , but this is even worse: you will deal with paths lengthier than they need to be, and your code will probably break if you try to run it on another machine.

Let’s say you have your working directory on , from which you run or . The data directory is located at , and your notebooks are in

We propose two ways to solve this problem:

  1. Using a environment variable inside a notebook
  2. Using a data module

Environment Variable

With this approach, we inform the system the location of the data directory through the usage of an environment variable. Before starting a jupyter server, we will set the variable by doing

If you are on the folder, you can do:

to achieve the same effect. Now, this variable is accessible to all child processes you start from your bash terminal. In your notebooks, you can now do:

Now the only thing your notebook needs to know is the of the dataset. Sounds fair, right?

Another thing you can try to do is changing the working directory of the notebook itself by doing this:

This works, but I prefer the former, as the latter makes the notebook work in a directory that it is not, feels slightly shady :).

Finally, it might be a bit boring to set the environment variable every time you start a jupyter server. You can automate this process using python-dotenv. It will search for a file, first in the local directory, and then in all it's parents. Once it does, it will load the variables defined there. Check the project documentation if you like the idea!

Data Module

We used an environment variable to hold information about the project configuration, and exposed this to the notebook. But what about moving this responsibility somewhere else? We can create a module whose responsibility is to know the data directory, and where the datasets are. I prefer this approach, as it will make datasets explicit symbols inside the code.

We will need a folder to represent, well, the project's package. Inside it, we will create a module:

We use the dunder method which returns the current file path, and the built-in class to navigate through the directory. We make this package installable by creating the inside the folder:

We are almost there! Now, we install the package we just created in development mode, so that changes to the package won’t require a reinstall:

This should install the package, which can be accessed from the notebook:

This way, any notebook in any environment and location will access the data using the same method. If the data location changes, there’s just one location we need to change: the module.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Nilo Araujo

Written by

Data scientist from Brazil, Masters in UFC - the university ;) https://www.linkedin.com/in/nilo-araujo-8a13bb114/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Nilo Araujo

Written by

Data scientist from Brazil, Masters in UFC - the university ;) https://www.linkedin.com/in/nilo-araujo-8a13bb114/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store