Data Directory in Jupyter Notebooks

Nilo Araujo
Analytics Vidhya
Published in
3 min readNov 2, 2020

Managing access to data files in interactive computing

Photo by Hudson Hintze on Unsplash

Summary

Almost every notebook contains a pd.read_csv(file_path) or a similar command to load data. Dealing with file paths in notebooks, however, is kinda troublesome: moving notebooks around becomes a problem, and the notebook now has to know project locations. Here, we discuss a couple of approaches to handle this problem.

Introduction

Starting a notebook is always easy, you just start a couple of cells which often just contain a df.head(). However, as the project grows (and in industry they always do), you will need to organize your folders. You will need a folder for the data, and another folder for notebooks. As the EDA progresses, you will need more folders representing different subsections of the main analysis. On top of that, your project should be reproducible, so that your peers can download the code, run the script, and it will work as intended, hopefully yielding the same results you had :)

So, if you have a read_csv(relative_path_to_data) on your notebook, moving it from one folder to another will require a change in the code. This is undesirable, we would like it to work regardless of its location. You could solve this by using read_csv(absolute_path_to_data), but this is even worse: you will deal with paths lengthier than they need to be, and your code will probably break if you try to run it on another machine.

Let’s say you have your working directory on /system_name/project, from which you run jupyter lab or jupyter notebook. The data directory is located at /system_name/project/data, and your notebooks are in system_name/project/notebooks

We propose two ways to solve this problem:

  1. Using a environment variable inside a notebook
  2. Using a data module

Environment Variable

With this approach, we inform the system the location of the data directory through the usage of an environment variable. Before starting a jupyter server, we will set the variable by doing

export DATA_DIR=system_name/project/data

If you are on the /system_name/project folder, you can do:

export DATA_DIR=$(pwd)/data

to achieve the same effect. Now, this variable is accessible to all child processes you start from your bash terminal. In your notebooks, you can now do:

Now the only thing your notebook needs to know is the file_name of the dataset. Sounds fair, right?

Another thing you can try to do is changing the working directory of the notebook itself by doing this:

This works, but I prefer the former, as the latter makes the notebook work in a directory that it is not, feels slightly shady :).

Finally, it might be a bit boring to set the environment variable every time you start a jupyter server. You can automate this process using python-dotenv. It will search for a .env file, first in the local directory, and then in all it's parents. Once it does, it will load the variables defined there. Check the project documentation if you like the idea!

Data Module

We used an environment variable to hold information about the project configuration, and exposed this to the notebook. But what about moving this responsibility somewhere else? We can create a module whose responsibility is to know the data directory, and where the datasets are. I prefer this approach, as it will make datasets explicit symbols inside the code.

We will need a project_package folder to represent, well, the project's package. Inside it, we will create a project_data.py module:

We use the __file__ dunder method which returns the current file path, and the built-in Path class to navigate through the directory. We make this package installable by creating the setup.py inside the project folder:

We are almost there! Now, we install the package we just created in development mode, so that changes to the package won’t require a reinstall:

python -m pip install -e .

This should install the project_package package, which can be accessed from the notebook:

This way, any notebook in any environment and location will access the data using the same method. If the data location changes, there’s just one location we need to change: the project_data.py module.

--

--