Tidy up your Jupyter notebooks with scripts
Over the past few years, we have seen the transition from scripts to notebooks for data scientists. Jupyter notebooks are quickly becoming the preferred data science IDE. These notebooks are perfect for writing short code blocks to interact with data, but what happens when your project grows?
At Data Science Experience, we see notebooks as the primary way data scientists want to code… but not all code should stay in the notebook. Helper functions,
Classes, messy visualization code — all the necessary bits that we do not need to include in a notebook that could be used for a presentation to communicate results. Let’s start cleaning up our notebooks.
First, I will describe how to take an existing .py script or package and use it in IBM Data Science Experience (DSX). Then, I’ll show my approach for setting up projects to facilitate clean notebooks.
Importing existing Python scripts in DSX
DSX offers a collaborative enterprise data science environment in the cloud, but many times it’s necessary to migrate existing scripts for use in DSX projects. Here are options for using a locally developed script in DSX:
- Copy/paste code from local file into a notebook cell. At the top of this cell add
%%writefile <your_file_name>.py- this will save the code as a Python file in your GPFS working directory (GPFS is the file system that comes with the DSX Spark Service). Any notebooks using the same Spark Service instance will be able to access this file for importing.
- Load the Python script into Object Storage. You can use
Insert to Code, then take the string and write to a file in GPFS that can then be accessed the same way.
I recommend option 1 because it allows you to continue to tweak code and update the script written to GPFS from this notebook. I will go into this in more detail in the section below on setting up your project.
Importing existing packages in DSX
The methods above work great if you just have a single script that you need to import from (or execute) from inside a notebook. If you have a Python package, the following options are available for importing in DSX:
- Python — Package up your code ( Here is an example of a simple package I wrote)
- R — Package up your code (great post from Hilary Parker). Check out a simple R package example here
- Put it in a repo and install. This can be accomplished from a public or private GitHub repository (I’m sure others as well, but I have only used GitHub). Pip installing from a public repo looks like this:
!pip install git+https://github.com/gfilla/dsxtools.git
If you need to install from a private GitHub repository, it looks like this:
!pip install git+https://<user_name>:<personal_access_token>@github.<your_company>.com/<your_org>/<your_repo>.git --ignore-installed
You get your
personal_access_token from Settings > Personal Access Tokens > Generate new token. You need to give repo access to this token.
For R — use this syntax for installing a package from GitHub:
install_github('<username>/<repo>') #installs the package
library('<repo_name>') #loads the package for use
2. Zip it up and load from Object Storage. This is similar to option 2 above, this time we zip up the directory with the Python package and load in an Object Storage container. Here you can use this code to get/save the zip.
Once you have installed/saved the package in GPFS, you are good to start importing inside your notebook!
Bring this all together in a DSX Project
At this point, you should feel confident in importing existing Python code for use in a DSX project. Let’s build on this to review one method for building out a larger project.
When I start a new project and have a clear vision for my goals, I will start with a “Class” notebook. This will be the notebook I will work in mostly for the early stages of my project. This notebook will be the messiest of all notebooks through most of the project lifecycle, but at the end it will be the cleanest — only including the code for the classes I will use for the project. Each cell in this notebook contains a class, we can easily write each of these cells to a Python script using the
%%writefile method described above. Other notebooks in this project will import these classes to access the methods to have overall much cleaner code.
You may be asking yourself why you should use this method instead of just using an IDE intended for writing larger Python programs. That is a fair question — and some projects can definitely require that approach. I prefer staying inside notebooks for
class development for the same reason I use them for data analysis. I can quickly tweak my class and have any experimental code in subsequent code cells to fix any bugs (this is where it can get messy).
To complete the example, I’ll show how one of these classes is used in my other notebooks. At this point, if you are new to Python and have not used classes I recommend checking out the documentation to see how they can be incorporated in your code.
After executing the cell where I write the Python class to GPFS, I can simply import using syntax
from <python file name> import <class name>
cnnParser is the name of my class, I instantiate an instance in the
cnn variable. A very nice benefit of using classes and hanging methods on the class is that Jupyter shortcuts are available to view all methods/attributes of the class (Shift + Tab to get the view in the screenshot). If you didn’t know about this shortcut — check out this post.
That should be enough to get started using scripts, packages, and notebooks together in a complementary way. If you know any tips/tricks I missed please let me know! Happy coding :-)