Using the Kaggle API

Manuel Valenzuela
MCD-UNISON
Published in
7 min readSep 20, 2021

Learning the basics of the command line tool API implemented in Python 3 using Jupyter Notebooks for searching and downloading datasets.

“Kaggle logo” by Databuff is licensed under CC BY-SA 3.0

What is Kaggle?

Kaggle is an online community of data scientists and machine learning practitioners. Most people associate it with a Machine Learning competitions platform, since this was its first product back in 2010, but now it is more than just a competition platform. Kaggle is an ecosystem for doing and sharing Data Science that over the past few years has evolved immensely.

In order to give you a quick overview, We will mention some of the things that you will find on Kaggle:

Code (Kernels).

Kaggle Kernels is a cloud computational environment that enables reproducible and collaborative analysis. Kernel is essentially a Jupyter notebook, or a script (in Python or R) or an R Markdown script running in a Kaggle docker that has almost everything installed for you. These kernels are entirely free to run. Exploring and reading other Kaggle’s code is a great way to both learn new techniques and stay involved in the community.

Competitions.

Kaggle competitions are machine learning tasks made by Kaggle or other companies like Google or WHO. If you compete successfully, you can win real money prizes. Competitions range in types of problems and complexity. You can take part in one even if you’re a beginner.

Kaggle Learn courses.

Free micro-courses taught in Jupyter Notebooks to help you improve your current skills.

Datasets.

Probably the most important part of Kaggle. As per the Kaggle website, there are over 50,000 public datasets and every day a new one is uploaded on. Each dataset is a small community where one can discuss data.

Kaggle API.

In order to use the Kaggle’s public API to interact with Kaggle resources, we must first authenticate using an API token.

Authentication.

From the site header, click on your user profile picture, then go to the ‘Account’ tab of your user profile this will take you to your account settings, scroll down to the section of the page labelled API. To create a new token, click on the “Create New API Token” button. This will trigger the download of kaggle.json, a file containing your API credentials.

If you are using the Kaggle CLI tool, place this file in the location ~/.kaggle/kaggle.json on Linux, OSX, and other UNIX-based operating systems, and at C:\Users\<Windows-username>\.kaggle\kaggle.json on Windows. If you are using the Kaggle API directly, where you keep the token doesn’t matter, so long as you are able to provide your credentials at runtime.

With Kaggle API, all interactions with Kaggle can be done using the command-line tool (CLI) implemented in Python.

Installation.

First, we need to ensure we have Python 3 and the package manager pip installed. Since we are going to work on a Jupiter notebook, not from a command line directly, we need to use an exclamation mark ! to tell Jupyter that we are executing a shell command. Execute the following command to install the Kaggle API.

!pip install kaggle

It will look something like this:

If everything was done correctly, we are now ready to use kaggle CLI tool, lets tray the !kaggle --Version command to verify that everything works well.

Interacting.

The CLI tool allows us to interact with these 3 types of Kaggle’s products:

  • Competitions
  • Kernels
  • Datasets

Let's use !kaggle --help to see the commands that the CLI tool supports:

As we can see, we can use:!kaggle competitions, !kaggle datasets, kaggle kernels and !kaggle config commands to interact with kaggle, each of them had their own arguments to specify what exactly we want to do.

kaggle competitions {list, files, download, submit, submissions, leaderboard}
kaggle datasets {list, files, download, create, version, init}
kaggle kernels {list, init, push, pull, output, status}
kaggle config {view, set, unset}

In this article, we will focus exclusively on the interaction with datasets.

Datasets.

The Kaggle API and CLI tool provide easy ways to interact with Datasets on Kaggle. The commands available can make searching for and downloading Kaggle Datasets a seamless part of your data science workflow.

We know now that to interact with datasets we will use !kaggle datasets command. With CLI arguments, we can search, download and create datasets, among other option:

usage: kaggle datasets [-h]
{list,files,download,create,version,init,metadata,status} ...
optional arguments:
-h, --help show this help message and exit
commands:
{list,files,download,create,version,init,metadata, status}
list List available datasets
files List dataset files
download Download dataset files
create Create a new dataset
version Create a new dataset version
init Initialize metadata file for dataset creation
metadata Download metadata about a dataset
status Get the creation status for a dataset

Find Data.

To search for datasets, we can use the command Kaggle datasets with the list argument. For a complete list of options lets try the !kaggle datasets list -h command:

usage: kaggle datasets list [-h] [--sort-by SORT_BY] [--size SIZE] [--file-type FILE_TYPE] [--license LICENSE_NAME] [--tags TaG_IDS] [-s SEARCH] [-m] [--user USER] [-p PAGE] [-v]

optional arguments:
-h, --help show this help message and exit

--sort-by SORT_BY Sort list results. Default is 'hottest'. Valid options are 'hottest', 'votes', 'updated', and 'active'

--size SIZE Search for datasets of a specific size. Default is 'all'. Valid options are 'all', 'small', 'medium', and 'large'

--file-type FILE_TYPE Search for datasets with a specific file type. Default is 'all'. Valid options are 'all', 'csv', 'sqlite', 'json', and 'bigQuery'. Please note that bigQuery datasets cannot be downloaded

--license LICENSE_NAME
Search for datasets with a specific license. Default is 'all'. Valid options are 'all', 'cc', 'gpl', 'odb', and 'other'

--tags TAG_IDS Search for datasets that have specific tags. Tag list should be comma separated

-s SEARCH, --search SEARCH
Term(s) to search for

-m, --mine Display only my items

--user USER Find public datasets owned by a specific user or organization

-p PAGE, --page PAGE Page number for results paging. Page size is 20 by default

-v, --csv Print results in CSV format (if not set print in table format)

Let's search, for example, ‘COVID’ datasets:

kaggle datasets list -s covid

Now let's search for COVID datasets and sort the result list by votes:

!kaggle datasets list -s covid --sort-by votes

We can see the files from a specific dataset with !kaggle datasets files <owner>/<dataset-name> for example, let’s see the files in the “imdevskp/corona-virus-report” dataset.

!kaggle dataset files imdevskp/corona-virus-report

Download the Data.

After you have searched for the appropriate dataset using CLI arguments for searching, this API provides an advantage to download any datasets from Kaggle to your local machine. Commands to download the files associated with the datasets using CLI. Let's see what the parameters for the download command are:

usage: kaggle datasets download [-h] [-f FILE_NAME] [-p PATH] [-w] [--unzip]
[-o] [-q]
[dataset]

optional arguments:
-h, --help show this help message and exit

dataset Dataset URL suffix in format <owner>/<dataset-name> (use "kaggle datasets list" to show options)

-f FILE_NAME, --file FILE_NAME
File name, all files downloaded if not provided. (use "kaggle datasets files -d <dataset>" to show options)

-p PATH, --path PATH Folder where file(s) will be downloaded, defaults to current working directory

-w, --wp Download files to current working path

--unzip Unzip the downloaded file. Will delete the zip file when completed.

-o, --force Skip check whether local version of file is up to date, force file download

-q, --quiet Suppress printing information about the upload/download progress

At this point we have already seen how to search for datasets and also how to see the files it contains, let's download the “imdevskp/corona-virus-report” COVID dataset:

!kaggle datasets download imdevskp/corona-virus-report

In case we just want to download a particular file from the dataset, we use the next command:

kaggle datasets download imdevskp/corona-virus-report -f covid_19_clean_complete.csv

We have just downloaded a single file from a dataset, but we first downloaded the complete dataset, this dataset is in .zip format. Let's decompress the file, so we can access to all the information inside. Looking in the working directory, we see that the name of the file is: corona-virus-report.zip.

And that's it! Now we have the datasets in our working directory, ready to work with them.

Conclusion.

These instructions are the some basic commands required to get started with searching and downloading datasets on Kaggle. You can see this is nothing complicated and can save you a lot of time, especially when downloading on a daily basis.

I hope this article has been helpful, and you had enjoyed reading it.

You can find out more details from the Official Documentation on GitHub.

--

--

Manuel Valenzuela
MCD-UNISON

Academic Technician of the Computer Science program — University of Sonora.