Loading datasets into Google Collab

Icaro
7 min readAug 27, 2023

--

After a long hiatus due to too much work, I am back to writing on Medium and you can expect a bunch of new articles about a variety of software engineering topics in the next few months.

The area today is Data Science, one of my favorites, and we will covering a basic but essential step when working with datasets: how to load these datasets into Google Collab notebooks.

Background

Google Collab is a hosted Jupyter Notebook service that allows us to start working with datasets right away. I have been using it for years for various data science classes and projects and it has been very easy to work with.

Loading a dataset

As Data Scientists, one of our first steps once we have a dataset is to load a dataset into a notebook so we can explore it, get familiar with its contents, decide what features we want to keep and which ones to delete, divide into training and test portions and finally feed it to the ML model you have created.

In this article, we will go over two ways to load a dataset into google collab and its pros and cons. The 2 methods are:

  • Download the dataset directly into your Google Collab Notebook
  • Download the dataset to google drive

For this exercise, we will be using the Kaggle dataset that has a list of all the English Premiership games that have been played since the league was created (founded in 1992, started paying in 1993), so we have data from 1993. In future articles we will use this dataset to predict how will win the 2023–2024 Premiership (Arsenal!!! #coyg), who will be in the Champions League and Europa League and which clubs will be relegated to the Championship. For now we just upload the dataset into Google Colab

Method I: Download the dataset to Google Collab

This method calls for downloading the dataset directly from the kaggle site to google collab, very convenient but there are some limitations we will go over later. To accomplish this we perform the following steps:

Get an API key from Kaggle

I assume you have an account on Kaggle, if not, go create one, it’s free :-) Once you have an account, go to your profile and you should see a section for API as shown below.

API Section of your Kaggle Profile

Click on create a new token and your token, in the form of a file called kaggle.json will be download to your default downloads directory as shown below. Mine shows 3 version because I download it 3 times but once is enough.

Downloading Kaggle’s API key to your computer using Chrome

Now that you have API key file you need to upload it to your google collab using the Files/Folder icon shown below

File/Folders Icon in a Google Collab Notebook

Click on that icon and you will see a new section appear in your notebook. Click on the upload button shown below and notice the message that appears when you hover over it: ‘upload to session storage’. We’ll talk about this in the pros and cons section of this article.

Upload a file icon in a Google Collab Notebook

Once you click on upload icon you will be presented with the standard user experience of uploading a file, find the kaggle.json you just downloaded and upload it to your collab notebook. Once it has been uploaded you should see it appear in the list of files as shown below.

Kaggle API key in a Google Collab notebook

Cool, now we are ready to start downloading your dataset to collab. We start by:

  • installing the kaggle library
  • making a directory where the API key file is going to live.
  • copy the directory to that file
  • change the permission for the file as shown below.

You only need to do this once, by the way, your code cell should look like the one below.

! pip install -q kaggle  #install the kaggle library
! mkdir ~/.kaggle #create a directory where the API Key is going to live
! cp kaggle.json ~/.kaggle/. #copy the API key to the directory
! chmod 600 ~/.kaggle/kaggle.json #change the permissions of the API key file

Now we can actually download the dataset with a command like the one shown below

! kaggle datasets download irkaal/english-premier-league-results

Kaggle has datasets and competitions, the command to download either is the same, namely:

! kaggle dataset-type path-to-dataset/dataset-name

the notebook will let u know the file has been uploaded as shown in this image:

Downloading a Kaggle dataset into a Google Collab Notebook

Since we downloaded a zip file, we have to unzip it:

Unzipping the dataset after uploading it

We can do a quick ls command to make sure the file is there and we can see it, it’s results.csv shown below.

Listing the just-uploaded dataset

and now we are ready to explore the dataset. Our objective in this article is not to do a full exploration of the dataset so we just issue a couple of simple commands to make sure the dataset is there and and we can manipulate it.

We start with loading the dataset into a pandas dataframe so we can easily manipulate it. So first we import the pandas library

import pandas as pd

We then define the path to the dataset and read it into a pandas dataframe as shown below.

my_dataset_file_path = "/content/results.csv"
my_df = pd.read_csv(my_dataset_file_path, encoding='latin-1') #load the file into a panda dataframe

Now we can print the size of our data frame as shown below.

Printing the size of a Pandas dataframe

Finally we print the first few lines of the dataset

my_df.head()   #print the first few rows of the dataset

and we should see something like the image below

The first 5 lines of the English Premier results kaggle dataset

Notice how there are some columns with NaN values. We will deal with those in another article.

OK so that was the first way to upload a dataset to your collab notebook. Now let’s look at the other way.

Method II: Download the dataset to Google Drive

In this method we download the dataset to google drive and then mount google drive in your collab notebook. Let’s get started.

First, you have to download the file to your laptop and then upload it to google drive. After you do that we can import drive to our notebook:

from google.colab import drive;

then we mount it and we should get a confirmation that drive has been mounted as shown below:

Mounting Drive in a Collab Notebook

Now that drive is mounted we define the path to the file, just like we did with the other method.

drive_file_path = "/content/drive/MyDrive/ColabNotebooks/datasets/results.csv" #we define the file path for the dataset

and we load it into a pandas dataframe, just like we did in the other method:

my_dataframe = pd.read_csv(drive_file_path,encoding='latin-1') #load the dataset into a pandas dataframe

and now we can explore it using whatever commands necessary. Here we just post the same commands we use in the previous method

Listing the first few lines of a Pandas data frame

As you can see, the results are the same as with the previous method. Now you have your dataset in a pandas data frame and you can start analyzing the data and decide what features to keep, what ML model you are going to use with it and so on. We will go over that in a future article.

You can see the notebook in github here

Pros and cons of each method

Well, both methods are pretty easy to implement. There are a couple of differences between the two:

  1. API Key: the first method relies on an API key and, as you know, an API key will have a lifetime and/or you can expire your API token so you will need to create a new one and download it again.
  2. Files uploaded to a collab noteboook are only available during the duration of the notebook session: this means you have to upload the kaggle.json file every time you start a notebook session. Not a big hassle but something to know.

All things being equal I prefer the option of uploading your dataset to google drive and then mounting drive on your notebook. However, each case is different so look at your environment and situation and make the decision that is better for your project.

Conclusion

In this article we looked at a couple of ways of uploading a dataset into a google collab notebook. Either way, uploading directly from kaggle or downloading to google drive, is a perfectly fine way and is just one small step in your journey of creating an awesome ML model.

Future articles will continue to look into the english premiership dataset and we will create an ML model that will predict who will the 2023–2024 English Premiership season. Come on you gunners!!!

--

--

Icaro

Writing about various Machine Learning and software engineering topics when the day job is not too crazy