How to use Papermill in Google Colab
Happy holidays and welcome to my first story on Medium! This story will be a simple tutorial on how to programmatically modify the source code of a Colab Notebook to make it compatible with Papermill. If you are already familiar with Colab and Papermill, feel free to skip down to the notebook section.
Note to reader: this story assumes you already have familiarity with IPython or Jupyter Notebooks.
Netflix and Papermill
Just earlier this month, Netflix open-sourced their new data science frame work, Metaflow. This framework is built around the notebook environment, instead of traditional IDEs such as PyCharm or VSCode. A year earlier, foreshadowing Metaflow, several members of the Netflix team wrote a story, describing how Netflix was making a substantial effort to move towards notebooks as an integrated development platform. Notably, they described Papermill as the “biggest game-changer”. Here is an excerpt from that story:
Papermill enables a paradigm change in how you work with notebook documents. Since Papermill doesn’t modify the source notebook, we get a functional property added to our definition of work — something which is normally missing in the notebook space. Our inputs, a notebook JSON document and our input parameters, are treated as immutable records for execution that produce an immutable output document. That single output document provides the executed code, the outputs and logs from each code cell, and a repeatable template which can be easily rerun at any point in the future.
Naturally, Papermill has steadily become a tool for many data scientists, both amateur and professional, looking to scale their current notebooks.
Google Colab
Google Colaboratory is an earlier but equally notable addition to the data science toolkit, especially for amateurs. In October of 2017, Google open-sourced their internal notebook platform for data science, specifically oriented towards machine learning. Not only did the platform include Google Drive and Github integrations, but also a cloud environment built for data science with free GPU instances. This was a major game changer for anyone looking to get into data science but lacking the necessary income or resources to support it. I’ve personally been using Colab for some time now as it has relieved me of much of the package management overhead required in small data science projects. Unfortunately, when embarking on my journey to learn Papermill, I discovered that Colab did not natively support Papermill. After an evening of reading and debugging, I was able to figure out the compatibility issues and address them with only a few lines of code.
How Papermill Normally Works
Using Papermill in a standard Jupyter notebook is a seamless experience.
Per Figure 1, it is as simple as a few clicks to tag a cell as a parameter cell. This lets other notebooks know it can override the values in that cell when using Papermill. Unfortunately, in Google Colab, this tagging function is unavailable. Furthermore, the only code I could find even implicitly addressing this issue was this article and it does not work inside the Colab environment. This motivated me to set out and create my own patch.
Colab Papermill Patch
If you haven’t setup Google Colab already, navigate to their site and take some time to walk through their welcome notebook. After which, download the notebooks necessary for this tutorial from my Github repo and extract them. Then, inside Colab, click File > Upload notebook… and upload the extracted notebooks.
Note to reader: alternatively, you could use the Github integration but you will need to click ‘Copy to Drive’ and rename them (removing ‘Copy of’ from the title).
Now you will have both notebooks stored inside your Google Drive, inside the path /My Drive/Colab Notebooks. At this point, you are more than welcome to run all the cells in the Colab-Papermill-Driver notebook and infer the changes made to the Sample.ipynb, skipping the rest of this section. If you are interested in learning more, please read on!
Bring the tab containing Colab-Papermill-Driver.ipynb into focus and run the cells below the Install and Import. In the next section, Connect, two connections are created to your Google Drive. This seems redundant but the upDrive object and downDrive object both have unique functions we will need to use in order to parameterize the Colab notebook.
In the navigate section, we change the current directory to the folder where Colab-Papermill-Driver.ipynb and Sample.ipynb should be located. The assert statement verifies you are in the correct folder.
The next section, Select and Modify, is where we begin making changes.
- The first cell loads the target notebook (Sample.ipynb) into a dict type variable named j. Notebooks are actually just a specification of a JSON file and therefore can be interacted with as a dictionary type.
- The second cell adds the parameters tag to the first cell in j.
- The third cell specifies the kernel language as Python within j. This is crucial since Papermill will be unable to parse a notebook if it lacks a language key and value.
- The final cell navigates back to the main directory in the Colab server and dumps j to a new notebook with the same name as target. This notebook will be the one used by Papermill in the next section.
In this section, we will be uploading the parameterized file to Google Drive using PyDrive.
- The first cell creates a PyDrive GoogleDriveFile object with a title and mimeType. The mimeType is important for web servers because it specifies how the file should be handled by the server. If you do not specify the mimeType, Google Drive will treat it as an “application/octet-stream” mimeType which is not associated with the Google Colab service. This will not prevent you from using Google Colab to interact with the file but it will change the file appearance in your folder.
- It is important that after running the upload cell, you verify the file upload has been completed successfully by checking the Files tab in the left pane. It is a shame that the upload takes so long, but it is required if you want the correct mimeType.
- After verifying the upload we remove all copies of Sample.ipynb and move the uploaded Sample.ipynb to the correct folder.
- In the first cell, we upload a blank notebook to be overwritten by Papermill.
- As before, we need to verify the upload is complete before moving forward. After which, we move the uploaded notebook into the correct folder.
- Then we, define the parameters we want to pass to the Sample.ipynb notebook we just created. The parameters are defined in a dictionary type such that ‘variable_name’ : ‘new_value’. Now, when we call pm.execute_notebook(), the values in the first cell of Sample.ipynb are overwritten with the values defined by our parameters dictionary.
The output notebook is saved to a separate file named Sample_Output.ipynb. If you navigate to that file, you will see it has an additional # Parameters cell that redefines the variable in the cell above.
Summary
I hope this story was clear and you have not only added a new notebook to your toolkit, but also gained a deeper understanding of notebook development. I wouldn’t call this patch perfected by any means and I encourage anyone reading this to try to create a package such that the target notebook modifies its own JSON source code. To me, Papermill is really just the start of what I like to refer to as the Notebook Oriented Programming movement. It became apparent to most developers very quickly that Notebooks on their own aren’t particularly scalable. Most notably, they lack the abstraction capabilities of traditional Object Oriented Programming (inheritance, polymorphism, etc). The Netflix team seems to be doing a fantastic job with exploring this scalability and I’m excited to see how Metaflow evolves overtime. Again, I hope you had a good read and have happy holidays!