Python workflow for interactive projects (Colab + Github + Drive). Kaggle use case.

Leverage free cloud resources for your Jupyter Notebooks

Aleix López Pascual
Analytics Vidhya
8 min readOct 11, 2020

--

Jupyter Notebooks have become one of the most used tools for Python development in Data Science [1]. They are highly preferred by many data scientists due to their user-friendly interface and interactive computational environment.

Image from JetBrains

Project Jupyter has been offering 100% open-source software in order to make Jupyter Notebooks accessible for everyone since 2014 [2]. However, these notebooks have limitations:

  • Jupyter Notebooks run on your local machine, making the computational power available to you entirely dependent on your computer’s CPU/GPU/RAM/etc. specs. While most laptops are more than enough for the basic tasks one encounters when starting out in data science, you might quickly hit barriers once you start doing machine learning tasks (especially deep learning) on your local machine.
  • Jupyter Notebooks do not come with version control per se. It is right that you can upgrade your Jupyter Notebooks to JupyterLab, and from there install the extensions @jupyterlab/github and @jupyterlab/git. I have used that before, but today I will show you a better option.

Google Drive + Google Colab + GitHub

In order to overcome the aforementioned limitations we will use the following combination:

  • Google Colab: Colaboratory is a free Jupyter notebook environment that runs in the cloud and stores its notebooks on Google Drive. Ie, we will use Google Colab to develop our Jupyter Notebooks without depending on the computational power of our laptops. It can also be used as a shell to run bash and git commands.
  • Google Drive: When we use Google Colab, the code is executed in a virtual machine private to our Google account. These virtual machines are deleted when idle for a while, and have a maximum lifetime of 12 hours. Therefore it is not ideal to store our work there since it will eventually be lost. The solution that we propose here is to store your output (weights, csvs… ) in a cloud storage hosting. As we all know, Google Drive is the cloud storage provided by Google. It provides 15 GB of free storage and it is the default storage system when using Colab.
  • GitHub: A code hosting platform for version control and collaboration. It’s a good practice to use version control and branch strategy even when working with Jupyter Notebooks.
The three parts of our infrastructure

Other cloud-based notebook services

Google Colab is not the only free cloud service for Jupyter Notebooks. These are the most relevant to me:

  • Google Colaboratory
  • Kaggle Kernels
  • Microsoft Azure Notebooks
  • Datalore
  • Binder
  • CoCalc

All of them are completely free, they do not require you to install anything in your local machine, and they give you access to a Jupyter-like environment.

I personally like Colab because of the following:

  • GPUs often include Nvidia K80s, T4s, P4s and P100s. Their availability varies over time. [3]
  • TPU v2–8 [4]
  • Idle cut-off 90 minutes
  • Maximum 12 hours

On the other hand, we also have the Kaggle Kernels [5]:

  • Nvidia Tesla P100 GPU
  • TPU v3–8
  • Idle cut-off 60 minutes
  • Maximum 9 hours using GPUs
  • Maximum 3 hours using TPUs

As I am an active participant of Kaggle competitions, I am always combining both. This can be managed easily using features toggles. More on that later.

Setting up the workflow

In this section, I am going to guide you through the different steps you must complete in order to configure the workflow. First things first, I am assuming that you already have a Goggle and Github accounts and you are familiar with them. Otherwise you will not be able to complete the process. If you are ready, just create a Google Colaboratory Notebook using your Google account. You will have to copy the following snippets of code as cells step by step.

1. Feature toggles and path

The above snippet defines the feature toggles that we will activate or deactivate depending in our use case. It has several of them since I have tried to make it as general as possible.

As you can observe, I will be using my private repository osic-pulmonary-fibrosis-progression as an example. You should put the name of your public or private repository of interest. Same for FILE_NAME.

Notice that there is the possibility to deactivate both COLAB and KAGGLE toggles. This should be done in case we want to run the notebook locally using our IDE or editor of preference (e.g. PyCharm). This is beneficial in certain situations. I will comment more on that at the end of the article.

2. Linking personal Google Drive storage with Google Colab

Here we mount Google Drive to Colab. Mounting is the process by which the OS makes files and directories of a storage service (Google Drive) available for the users via the computer’s file system. In order to do so, we will be required to authenticate our account. Notice that we can mount a Google Drive account different from the Google account from which we are running Google Colab.

If you see “Mounted at /content/drive”, it means that Google Drive was mounted successfully. After that, you should be able to find a new directory called gdrive in the Colab file explorer.

3. Clone GitHub repository to Colab Runtime system

Here we clone our GitHub repository to the personal Virtual Machine initiated when we started the Colab session. Remember that it has a maximum lifetime of 12 hours.

Notice we will need to pass some credentials in case the GitHub repository to clone is private. In order to keep the credentials safe from the public (we do not want everyone to have access to our private repositories), we will store them in a file (git.json) in our private Google Drive account. In this way, the credentials will only be accessible during the Colab session if you have access to the Google Drive account.

How to generate our credentials file

Go to your Google Drive account. Create a directory called Git. Inside this directory create a file called git.json. The file needs to look like this:

In case you do not have a GitHub access token, generate it as follows:

Go to user profile at the top right corner → click on settings → then choose developer settings.

Replace the credentials into the git.json file.

Once you have the credentials ready, you should be able to clone the repository successfully. If so, you will find a new directory with the name of the repository in the Colab file explorer once you execute the cell.

We can also clone our repository on Google Drive. If you prefer to do so, just uncomment the line of code that is mentioned in the feature toggles snippet. Following this second option, you can save your modifications in your Google Drive synchronously. However, I prefer to push my changes to GitHub and clone the repository in each session. This becomes more practical if you are also running your notebooks locally using your IDE.

In case you want to run the Notebook locally, you need to copy the git.json file we generated in your home directory.

4. Kaggle API Setup

(Not necessary if you do not plan to use Kaggle)

Similar to what we did in the previous step, here we pass the credentials of Kaggle in order to use its API.

Go to your Google Drive account. Create a directory called Kaggle. Inside this directory create a file called kaggle.json. The file needs to look like this:

In case you do not have a Kaggle API token, generate it as follows [6]:

Sign in to your Kaggle account. From the site header, click on your user profile picture, then on “My Account” from the dropdown menu. This will take you to your account settings at https://www.kaggle.com/account. Scroll down to the section of the page labelled API. To create a new token, click on the “Create New API Token” button. This will download a fresh authentication token onto your machine.

Replace the credentials into the kaggle.json file.

In case you want to run the Notebook locally, you need to copy the kaggle.json file we generated in your home directory.

5. Download competitions data using the Kaggle API

(Not necessary if you do not plan to use Kaggle)

Here we create a new directory called input inside the VM and we download the data of the requested competition. Notice the data is stored in the Colab Runtime system so it does not occupy space in your Drive.

As you can observe, here I am reusing the GIT_REPOSITORY name to download the data. This is because my repository is called as the competition: “osic-pulmonary-fibrosis-progression”. If this is not your case, just replace the lines with the original Kaggle API commands from the competition webpage.

If successful, we should see a new directory called input inside the cloned repository, and all the data stored in it.

Keep in mind that you do not want to push your input folder into GitHub, so you should have a .gitignore file with input/ in it.

6. Save changes to GitHub

Remember that we can run all git commands from here. Therefore, we can automate the process of saving to GitHub.

Notice that FILE_NAME should coincide with the name of the notebook. Even though this is useful, I usually do not use it unless I need to let my whole pipeline running and want to save the changes after it. The alternative is to save it directly using the tools from Colab: Go to File, then Save a copy in GitHub. This will commit and push your changes to GitHub.

The workflow

Now that we have everything set up, let me describe briefly how my workflow usually looks like:

  • Create a repository.
  • Start a new notebook from Colab.
  • Copy paste all the mentioned cells.
  • Configure the feature toggles and constants.
  • Work on the notebook.
  • Save changes to GitHub continuously.

After that,

  • Once I start a session again, upload the notebook from GitHub using the Colab toolbar.
  • If I need to debug, I clone the repo locally, pull and open the notebook with my preferred IDE (PyCharm).

References

[1] “Python Developers Survey 2019 Results”. JetBrains. https://www.jetbrains.com/lp/python-developers-survey-2019/.

[2] Project Jupyter. https://jupyter.org/about.

[3] Google. https://research.google.com/colaboratory/faq.html#gpu-availability.

[4] Google. https://cloud.google.com/tpu/docs/tpus.

[5] Kaggle. https://www.kaggle.com/docs/notebooks.

[6] Kaggle. https://www.kaggle.com/docs/api.

--

--

Aleix López Pascual
Analytics Vidhya

Senior Data Scientist @ Glovo | Competitions Expert @ Kaggle | Writer @ Medium | MSc in High Energy Physics, Astrophysics and Cosmology