Data Science Shortcuts

Part II: Quickly Open a Github Repository Notebook in Google Colab

…And add repo data, all in the cloud

Jamel Dargan
The Startup

--

In the previous article, we discussed how a simple URL edit enables us to view and run Jupyter notebook code from a GitHub repository on a smartphone. But what do we do when we need a dataset from the repo to run the code?

Let’s look at a quick example of how a simple line-or-two of Python code makes it possible to add a dataset to our cloud notebook.

In this article we will demonstrate the following:

  • How to customize a link to a Github repository’s Jupyter notebook file so that it opens interactively in Google Colaboratory
  • How to read a repository’s dataset into the notebook, without having to download the data locally

The repository

This time our example notebook will come from Develop-Packt’s “Analyzing-the-Heart-Disease-Dataset” repository, licensed under the MIT License.

Author’s screen capture from Github of Develop-Packt’s Analyzing-the-Heart-Disease-Dataset, under the MIT License.
View of the Github repository. All screen captures by the author.

The repository comprises several activity and exercise notebooks in multiple folders. It also contains a single dataset in the “Dataset” folder.

Screen capture shows a cursor pointing out a link to the data file, from within the repository’s “Dataset” subdirectory.
CSV file link, from within the repo’s “Dataset” subdirectory.

Opening the “Dataset” folder reveals a hyperlink to the file “heart.csv”. We can select the link to preview the file on Github.

Previewing the first nine rows, including column names, of the dataset on Github.
Previewing the dataset on Github.

The preview shows the number of rows (column names included) in the data file. In many cases, preparing to import data into a Colab notebook may require you to copy the “Raw” link to the dataset.

Screen capture showing an open context menu highlighting the raw data file’s “Copy link address” option.
Open context menu highlighting the “Copy link address” option.

Right-selecting the Raw ‘button’ opens a context menu from which you can choose the “Copy link address” selection. The address provides an absolute link to the data.

Previewing the notebook

We will not actually need to copy the link address in this particular case because the link already is included in the repository notebooks. We can navigate back to the repo home page, open the Activity01 folder, and follow the “Activity01.ipynb” notebook link.

Previewing Github’s markdown version of the notebook, with imports and output from the built-in Pandas describe() function.
Notebook output from the Pandas describe() function on Github.

A static, Github-flavored markdown preview of the notebook will load in your browser. We can see (in code cell 2) that the data is read into a Pandas dataframe using the absolute, raw link we alluded to earlier.

Image credit.

In some cases, a relative link might be used to reference a dataset residing in the same repository as the notebook. Since notebooks for this repository are each contained in their own subdirectory, a relative link might be structured as follows:

df = pd.read_csv("../Dataset/heart.csv")

The absolute link is arguably more intuitive. It also is stable, in the sense that it does not need to change if it is called from a different directory or location. We will use the absolute link to import the data into Colab.

Interacting in Colab

As in the previous article, we open the notebook in Colab as follows:

  • In the address bar, we delete all characters of the web page URL through “github.com”.
  • We replace the deleted characters with the string “colab.research.google.com”.

The new URL should read as shown below:

Screen capture of the customized web address.
Highlight of the customized web address.

Submitting the created link http://colab.research.google.com/github/Develop-Packt/Analyzing-the-Heart-Disease-Dataset/blob/master/Activity01/Activity01.ipynb opens the notebook inside of Google Colab, where we can interact with the code.

Screen capture of the “Activity01” notebook open in Colab, with library and data import code-cells visible.
The “Activity01” notebook open in Colab.

This method opens the notebook interactively. It does not clone the entire repository. We can see from the screen capture above that we almost immediately require a dataset from the repository.

Bring in the data

As we mentioned, this particular notebook shows us exactly how we can use the raw data file’s absolute link to import the data into Pandas. This is accomplished here in a single line of code:

df = pd.read_csv('https://raw.githubusercontent.com/Develop-Packt/Analyzing-the-Heart-Disease-Dataset/master/Dataset/heart.csv')

Alternately we can first instantiate the URL as a variable and then replace the web address with the variable to create our Pandas dataframe, as follows:

# instantiate the raw data link as variable `url`
url = 'https://raw.githubusercontent.com/Develop-Packt/Analyzing-the-Heart-Disease-Dataset/master/Dataset/heart.csv'
# use the variable to read the data into Pandas
df = pd.read_csv(url)

At this point, we are only coding for ourselves, so there is no need to go too far out of our way (but it is good practice).

Having successfully imported our data into (cloud) memory, we can run the rest of the notebook’s code and generate its visualizations.

A horizontal box plot representing the distribution of the dataset’s cholesterol data. A ‘future warning’ is highlighted.
Horizontal box plot of cholesterol data distribution.

Incidentally…

The Seaborn plots in this notebook generate a ‘future warning’, as highlighted in the image above. The warning prompts coders that “data” will be the “only valid positional argument” in later versions of the library. We can update the code to test out this change.

We will change the following line:

chol = sns.boxplot(df['chol'])

…Updating it to the following:

chol = sns.boxplot(data = df['chol'])

Our change results in the following plot:

Updated code (“df[‘chol’]” to “data = df[‘chol’]”) removes the warning, but reorients the plot from horizontal to vertical.
The updated code removes the ‘future warning’.

The updated code removes the warning, but it also reorients our plot from horizontal to vertical. We can correct this with an additional edit to the same line of code:

Horizontal plot without the warning, its orientation restored by updating “orient = ‘horizontal’” to our code.
Updated code reorients the plot.

Summary

This article picked-up where the previous article left-off. We demonstrated how to open an interactive notebook in Google Colaboratory, from a modified Github link. We then showed how we can read a dataset from the same repository into cloud memory using Pandas. As a bonus, we looked at how we can prepare for a future change in code syntax for the Seaborn library.

Cloud-based platforms (Colab, Binder, Deepnote…) are making it almost as easy to view and explore interactive notebooks and machine learning code as it is to create a meme.

And you even ‘CAN HAS ON MOBILEZ’.

--

--