Using Kaggle API for Google Colaboratory

Timm Derrickson
3 min readApr 1, 2018

--

If you are working on machine learning, and you don’t have the resources to set up a GPU or pay for AWS (Amazon Web Services) then you are probably using Colaboratory . If you are not using Colaboratory, then I highly recommend you check it out! Colaboratory is essentially a way to use a Jupyter Notebook through your Google Drive account and run the notebook on a GPU. Also, it is completely free!

Here is a great article on getting started with the GPU on Colabratory: https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d

Now that you have Colabratory up and running on the GPU, you are going to want to start working on building models that will change the world with machine learning!

If you aren’t ready for that yet, then you might want to do some practicing. One of the best ways to practice machine learning on real datasets is to enter Kaggle competitions. Kaggle is a site that hosts machine learning competitions to practice building models, or to compete for cash prizes!

One problem you might run into is how to get the competition data from Kaggle onto Colaboratory and vica versa. Kaggle has built an API to fix that very problem.

Here is a very simple example of how to use the Colabratory in conjunction with the Kaggle to get the data for the Titanic: Machine Learning from Disaster competition.

First, install the Kaggle library.

!pip install kaggle

This will allow you to interact with the Kaggle API.

To use the Kaggle API, you have to create a Kaggle account. Once you have logged in, you will have to go to the ‘My Account’ section on your profile. Then you will have to click on ‘Create New API Token’ to use the Kaggle API. The ‘Create New API Token’ button will trigger a download of a file called ‘kaggle.json’. This file has the credentials of your API token for your account.

Put this json file in your Google Drive so that Colaboratory can find your credentials. Use the code below in Colaboratory to get access to the Kaggle API.

You will be prompted to follow a link, give access to your Google Drive and enter a verification code for the API to work. Next you will follow the link, grant access, copy the verification code, paste in the text box just after “Enter Verification Code:” and hit enter.

from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth
auth.authenticate_user()drive_service = build('drive', 'v3')
results = drive_service.files().list(
q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])
filename = "/content/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)
request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

Now you have access to the Kaggle API and the competition data!
To see the competitions, use this line. It will return a list of the competitions on Kaggle.

!kaggle competitions list

To see the files in a specific competition, in this case the Titanic competition, use the follow line. This will return a list of the files for the titanic competition.

!kaggle competitions files -c titanic

To get the list of files for another competition, just replace the word titanic with the name of the competition you want from the competitions list.

To download the datasets from a competition, use the following code. This will download all of the files from the Titanic competition to the directory “/content/kaggle/”.

!kaggle competitions download  -c titanic -p /content/kaggle

The following code will allow you to pull the training data from the Titanic competition into a pandas Dataframe so that you can start to work on building a model!

import pandas as pd
data = pd.read_csv('/content/kaggle/train.csv', header=0, sep=',', quotechar='"')
data.head()

Here are the docs for the Kaggle API, with a full list of commands for digging deeper.

One last note, the compression=’zip’ parameter in the pandas.read_csv method is extremely useful for using this API. Below is an example.

data2 = pd.read_csv('/content/kaggle/train_sample.csv.zip', compression='zip', header=0, sep=',', quotechar='"')

--

--