Version Control with Jupyter Notebook: A how-to article on version controlling Jupyter Notebooks using Git, including recommended workflows.

TechLatest.Net
6 min readJul 17, 2023

--

Introduction

Jupyter Notebooks are powerful tools for data analysis, experimentation, and collaboration. However, when working on complex projects, managing different versions of your Jupyter Notebooks can become challenging. This is where version control systems like Git come into play. In this article, we will explore how to effectively version control Jupyter Notebooks using Git, along with recommended workflows to streamline your development process.

Note

If you are looking to quickly set up and explore AI/ML & Python Jupyter Notebook Kit, Techlatest.net provides an out-of-the-box setup for AI/ML & Python Jupyter Notebook Kit on AWS, Azure and GCP. Please follow the below links for the step-by-step guide to set up the AI/ML & Python Jupyter Notebook Kit on your choice of cloud platform.

For AI/ML KIT: AWS, GCP & Azure.

Why you choose Techlatest.net VM, AI/ML Kit & Python Jupyter Notebook?

  • In-browser editing of code
  • Ability to run and execute code in various programming languages
  • Supports rich media outputs like images, videos, charts, etc.
  • Supports connecting to external data sources
  • Supports collaborative editing by multiple users
  • Simple interface to create and manage notebooks
  • Ability to save and share notebooks

Version Controlling Jupyter Notebooks with Git

Jupyter Notebooks are a great tool for data science and machine learning. But as your notebooks evolve, keeping track of changes can become difficult.

Version controlling your Jupyter Notebooks with Git allows you to:

  • Revert to previous versions
  • Track changes over time
  • Collaborate with others
  • Restore accidentally deleted cells

In this article, we will cover:

  • Why version control Jupyter Notebooks
  • Recommended Git workflows for notebooks
  • Setting up a Git repository
  • Tracking changes to cells
  • Ignoring output cells
  • Restoring previous versions
  • Tips for managing large notebooks

Why version control notebooks?

There are a few main reasons to version control your Jupyter Notebooks:

1. Reproducibility — Being able to revert to any point in a notebook’s history allows you to reproduce past analyses.

2. Collaboration — Multiple people can work on the same notebook and merge changes.

3. Backup — Git serves as an additional backup of your notebook in case of crashes or hardware failure.

4. Auditing — Git allows you to see who made what changes and when.

5. Rolling back — You can easily roll back to a previous version if needed.

Recommended workflows

The two main Git workflows for Jupyter Notebooks are:

1. Commit the .ipynb file — Simple but can become slow for large notebooks with large outputs.

2. Ignore outputs, commit .ipynb and .ipyb files — Only commit the .ipynb file which contains the code. The .ipyb file contains outputs that are ignored. This keeps repositories small and fast.

Setting up a repository

To version control a notebook:

1. Initialize a Git repository in the notebook’s directory.

2. Add a .gitignore file to ignore output cells and large files.

3. Make your first commit — this will act as a “baseline” version.

4. Continue working and committing changes as you make them.

Install and setup Jupytext

You need to perform this on all systems who will use the git repo with notebooks. That means all your teammates must have jupytext configured.

We use Jupytext library (https://github.com/mwouts/jupytext)

!sudo pip install jupytext --upgrade
  • Next generate a Jupyter config, if you don’t have one yet, with jupyter notebook --generate-config
  • edit .jupyter/jupyter_notebook_config.py and append the following:
c.NotebookApp.contents_manager_class="jupytext.TextFileContentsManager"
c.ContentsManager.default_jupytext_formats = ".ipynb,.Rmd"
  • and restart Jupyter, i.e. run
!sudo jupyter notebook

Note: .jupyter is mostly present in your home directory.

  • Open an existing notebook or create a new one.
  • Disable Jupyter’s autosave to do round-trip editing, just add the following in the top cell and execute.
%autosave 0
  • You can edit the .Rmd file in Jupyter as well as text editors, and this can be used to check version control changes.

Possible Git Workflows (2 Ways)

1. Saving only the Rmd file

We will remove .ipynb files and make a small change to .Rmd the file.

ipynb files have all the output in their JSON source, as such these when stored in source control add huge changes in diff even when the actual change is very small. Also, they have all the images/plots encoded as strings so it is heavy on source control. As such you can just check in the Rmd file into source control.

To do that in your .gitignore file add the below line in a new line.

*.ipynb

Note: All your teammates also need to do this so that they don’t commit ipynb files into Git.

In case you have checked in ipynb files already then remove them from source control after checking in the .Rmd files. To remove files from the git repo but not from the local directory (Refer here)

git rm --cached file1.ipynb
git commit -m "remove file1.ipynb"

Next, we make a small change to .Rmd file.

  • Open the .Rmd file in vi
  • On the line that begins with jupytext_formats: ... , change it to:
jupytext_formats: ipynb,Rmd:rmarkdown
  • Save the file and exit vi

Note: the change to .Rmd the file is needed only once when you create it and push it to git remote for 1st time.

Cloning the repo and creating a notebook

Once you removed the ipynb notebooks, when you clone the repo you want to create notebooks. Let's see how.

  • Open the .Rmd file in Jupyter from its file browser.
  • You can use the .Rmd file directly but it will not persist output between sessions, so we are gonna create a jupyter notebook.
  • Click File->Save (Cmd/Ctrl+S).
  • Close the .Rmd file (File->Close and Halt)
  • Now open the ipynb in Jupyter.
  • Start editing and saving. Your .Rmd file will keep updating itself.

2. Saving both Rmd and Ipynb file

Don’t add .ipynb to your .gitignore.

Make a small change to .Rmd file.

  • Open the .Rmd file in vi
  • On the line that begins with jupytext_formats: ... , change it to:
jupytext_formats: ipynb,Rmd:rmarkdown
  • Save the file and exit vi

Note: the change to .Rmd file is needed only once when you create it and push it to git remote for 1st time.

In this workflow, since you save both so you don’t need to do anything extra while cloning, the .ipynb will already be available so just start using it after cloning the repo.

Remember to put %autosave 0 in the 1st cell of your notebook and run it always. And since you have disabled autosave so remember to save your notebook frequently.

Conclusion

Version controlling Jupyter Notebooks using Git is a valuable practice for managing changes, collaborating with others, and maintaining a history of your work. By following the recommended workflows outlined in this article, you can effectively incorporate version control into your Jupyter Notebook projects.

Setting up a Git repository, tracking changes to your notebooks, and ignoring output cells are crucial steps in ensuring a clean and efficient version control process. By leveraging tools like Jupytext, you can enhance the version control experience by working with both the `.ipynb` and `.Rmd` files.

Whether you choose to save only the `.Rmd` file or both the `.Rmd` and `.ipynb` files, it’s essential to communicate this approach with your team and ensure everyone follows the same workflow to maintain consistency and avoid conflicts.

Remember to save your notebooks frequently, utilize meaningful commit messages, and regularly push your changes to the remote repository. By doing so, you can ensure a reliable version history and facilitate seamless collaboration.

Version control with Git brings numerous benefits, including reproducibility, collaboration, auditing, and the ability to roll back changes. By implementing these practices and incorporating version control into your Jupyter Notebook workflows, you can enhance your productivity, foster collaboration, and maintain a well-documented history of your data analysis and machine learning projects.

--

--

TechLatest.Net

TechLatest.net delivers cutting-edge tech reviews, tutorials, and insights. Stay ahead with the latest in technology. Join our community and explore the future!