Managing access to data files in interactive computing
Almost every notebook contains a
pd.read_csv(file_path) or a similar command to load data. Dealing with file paths in notebooks, however, is kinda troublesome: moving notebooks around becomes a problem, and the notebook now has to know project locations. Here, we discuss a couple of approaches to handle this problem.
Starting a notebook is always easy, you just start a couple of cells which often just contain a
df.head(). However, as the project grows (and in industry they always do), you will need to organize your folders. You will need a folder for the data, and another folder for notebooks. As the EDA progresses, you will need more folders representing different subsections of the main analysis. On top of that, your project should be reproducible, so that your peers can download the code, run the script, and it will work as intended, hopefully yielding the same results you had :)
So, if you have a
read_csv(relative_path_to_data) on your notebook, moving it from one folder to another will require a change in the code. This is undesirable, we would like it to work regardless of its location. You could solve this by using
read_csv(absolute_path_to_data), but this is even worse: you will deal with paths lengthier than they need to be, and your code will probably break if you try to run it on another machine.
Let’s say you have your working directory on
/system_name/project, from which you run
jupyter lab or
jupyter notebook. The data directory is located at
/system_name/project/data, and your notebooks are in
We propose two ways to solve this problem:
- Using a environment variable inside a notebook
- Using a data module
With this approach, we inform the system the location of the data directory through the usage of an environment variable. Before starting a jupyter server, we will set the variable by doing
If you are on the
/system_name/project folder, you can do:
to achieve the same effect. Now, this variable is accessible to all child processes you start from your bash terminal. In your notebooks, you can now do:
Now the only thing your notebook needs to know is the
file_name of the dataset. Sounds fair, right?
Another thing you can try to do is changing the working directory of the notebook itself by doing this:
This works, but I prefer the former, as the latter makes the notebook work in a directory that it is not, feels slightly shady :).
Finally, it might be a bit boring to set the environment variable every time you start a jupyter server. You can automate this process using python-dotenv. It will search for a
.env file, first in the local directory, and then in all it's parents. Once it does, it will load the variables defined there. Check the project documentation if you like the idea!
We used an environment variable to hold information about the project configuration, and exposed this to the notebook. But what about moving this responsibility somewhere else? We can create a module whose responsibility is to know the data directory, and where the datasets are. I prefer this approach, as it will make datasets explicit symbols inside the code.
We will need a
project_package folder to represent, well, the project's package. Inside it, we will create a
We use the
__file__ dunder method which returns the current file path, and the built-in
Path class to navigate through the directory. We make this package installable by creating the
setup.py inside the
We are almost there! Now, we install the package we just created in development mode, so that changes to the package won’t require a reinstall:
python -m pip install -e .
This should install the
project_package package, which can be accessed from the notebook:
This way, any notebook in any environment and location will access the data using the same method. If the data location changes, there’s just one location we need to change: the