Creating Pull Requests with Jupyter Notebooks
Have you ever wondered how you would give feedback to a person who forgot to label the x-axis of a chart?
When I started to work as a back-end developer, it was the wild west. I could deploy without anyone looking at my code, and I would only know that something had gone wrong when a client called to let me know something had crashed. One day, through a conversation at an event, I discovered the existence of Pull Requests (PRs)! With them, other developers could see my code and give me various feedbacks, such as how I could improve my code and what the good practices were for their respective implementations. There are numerous advantages to working with Pull Requests, but I found that the most amazing one was the sharing of knowledge they enabled. For more details on creating Pull Requests using bitbucket, check out this link.
When I became a data scientist, everything changed. I started working with Jupyter Notebooks, which was wonderful for my research and the visualization of charts and the results of my code, yet it was terrible for making PRs. Github renders notebooks, but they can’t be commented on. In PR mode, what appears is a JSON that isn’t easy to read.
In order to solve this problem, we scoured some communities, especially that of a project I really admire: “Serenata de Amor” (Love Serenade). There, we found a relatively efficient method: generate a .py as well as a .ipynb. To do this automatically each time you save your notebook, simply add this code to the file: ~/.jupyter/jupyter_notebook_config.py:
from notebook.utils import to_api_path
_script_exporter = None
def script_post_save(model, os_path, contents_manager, **kwargs):
“””convert notebooks to Python script after save with nbconvert
replaces `ipython notebook — script`
from nbconvert.exporters.script import ScriptExporter
if model[‘type’] != ‘notebook’:
if _script_exporter is None:
_script_exporter = ScriptExporter(parent=contents_manager)
log = contents_manager.log
# save .py file
base, ext = os.path.splitext(os_path)
script, resources = _script_exporter.from_filename(os_path)
script_fname = base + resources.get(‘output_extension’, ‘.txt’)
log.info(“Saving script /%s”, to_api_path(script_fname, contents_manager.root_dir))
with io.open(script_fname, ‘w’, encoding=’utf-8') as f:
c.FileContentsManager.post_save_hook = script_post_save
This automatically doubles the amount of files created! However, it allows you to see the notebook and comment on the .py file, in the cell that it makes sense for someone to modify it in.
It’s worth mentioning that there are other alternatives to making a Pull Request, especially if you are working with open source code. One of these alternatives is reviewNB, which allows for comments to be left directly in the notebook’s cells, but this solution is no longer free for private repositories and only works with Github (unfortunate for GitLab and BitBucket users). You can also perform tests with notebooks using nbviewer.
Another good practice we follow is our use of a slightly modified version of Cookiecutter Data Science to organize our projects. This way, we follow the rule that “Notebooks are for exploration and communication” — that way, data extraction code, feature engineering, and tuning models are kept somewhere else while the notebooks mainly serve for EDAs (Exploratory Data Analysis) and evaluations. This greatly facilitates the versioning and execution of these codes.
The structure of Cookiecutter’s folders makes it so that any person from the team can work and contribute to the research of all of Creditas’ data scientists.
Interested in working with us? We’re always looking for people passionate about technology to join our crew! You can check out our openings here.