How to use Kaggle API to download datasets in R

Luis Durazo
MCD-UNISON
Published in
2 min readNov 29, 2020

--

Kaggle has its own officially supported Python library to consume its API, this small tutorial is focused on doing a small portion of what that API can do in a very quick, very easy example for those who wish to automate dataset downloading in their own projects, specially those datasets that come bundled in a zip file, for datasets that have direct download please refer to the unofficial R API package.

TL;DR — Here's the code:

install.packages(c("devtools"))
devtools::install_github("ldurazo/kaggler")
library(readr)
library(kaggler)
kgl_auth(creds_file = 'kaggle.json')
response <- kgl_datasets_download_all(owner_dataset = "andrewmvd/violence-against-women-and-girls")

download.file(response[["url"]], "data/temp.zip", mode="wb")
unzip_result <- unzip("data/temp.zip", exdir = "data/", overwrite = TRUE)
violence_data <- read_csv("data/makeovermonday-2020w10/violence_data.csv")

Step by step guide:

Install devtools from CRAN:

install.packages(c("devtools"))

Then, with devtools install my fork of the unofficial R API library

devtools::install_github("ldurazo/kaggler")

The main difference between the original repository and my fork, is that I added a function that doesn't require you to specify the filename, so that if multiple files exist in the Kaggle resource you're trying to access, you get them all in a single zip file.

NOTE: I do not guarantee, and likely will not, maintain this fork in the future and as of today, I've requested to merge my fork into the main project, please see if this is supported in the original repository.

Next, we load the libraries and authenticate ourselves in the kaggle API

library(readr)
library(kaggler)
kgl_auth(creds_file = 'kaggle.json')

The kaggle.json files can be generated following this instructions in the official kaggle docs. (tl;dr go to account > create API token) it consists on a very simple json file with the username and your API key. the Kaggler library will take care of the rest in terms of authentication.

Next, to retrieve all files within a single resource, use my kaggler's fork function:

response <- kgl_datasets_download_all(owner_dataset = "andrewmvd/violence-against-women-and-girls")

Which will return a response object that contains a URL to Google Cloud Storage that contains the zipfile, the following lines ensure the retrieval, and extraction of it

download.file(response[["url"]], "data/temp.zip", mode="wb")unzip_result <- unzip("data/temp.zip", exdir = "data/", overwrite = TRUE)violence_data <- read_csv("data/violence_data.csv")

You can also use a templink() based temp file if you prefer, I simply like quick access to the contents of the downloaded .zip, as they often contain data descriptors and other utility documents for the dataset you're downloading.

Thank you for reading.

--

--