AZURE MINI TUTORIALS

How to Bring Kaggle Dataset Into Azure ML Workspace in Azure Portal

It is easier than it looks

Dr. Varshita Sher
The Startup

--

Kaggle is a goldmine of amazing datasets when it comes to machine learning projects. Let’s see how we can load one of them into our ML workspace in the azure portal.

Dataset

As part of this tutorial, we will be loading the Human Faces dataset available on kaggle. This is what I used for training GANs from scratch on custom image data.

Procuring Kaggle API key

Get your Kaggle user name and API key. To create a key:

  • Go to your kaggle account → Settings → Account → Create a new API token.
  • A kaggle.json file will be downloaded and it will contain your username and API key. Keep this somewhere safe!

Steps for connecting Kaggle data in Azure

Go to Azure Portal.

Click on Create a Resource -> Search for Machine Learning.

Click Create and follow the steps until you reach the following page.

Click on notebooks from the sidebar and click create a new notebook.

Create a Compute resource (if one does not exist already).

Next, open the terminal.

On the terminal window, pip install kaggle.

Set Kaggle user name and API key (from the json file) as environment variables in the terminal:

export KAGGLE_USERNAME=varshita
export KAGGLE_KEY=xxxxxxxxxxxxxx

Finally, you are ready to download your Kaggle dataset via the command line in the terminal. The API command to do so is available on the Kaggle dataset page itself. Click on the three dots next to New Notebook and select ‘copy API command’.

Next paste (or CTRL +V) on the terminal window

kaggle datasets download -d ashwingupta3012/human-faces

Note: This might take a while as you can see the file is approx 2GB in size

Voila…. you will see your dataset will be downloaded (as a zip file) in your current working directory onto your Azure workspace.

Alternatively, you can specify a folder where the files should be downloaded using optional arguments in the API call (for more info, see Kaggle documentation here). For example:

kaggle datasets download -p images/train ashwingupta3012/human-faces

Next, we must unzip the files to retrieve the actual images. I found the code snippet in Raghav Bang’s article.

I have just added some comments to make it more intuitive for readers. The following code goes in your Azure Notebook (or Jupyter Notebook editor, if you are using that).

import os
import zipfile
# name of the zip file you want to unzip
local_zip = 'human-faces.zip'
# opening a file with mode parameter 'r' : read existing file
zip_ref = zipfile.ZipFile(local_zip, 'r')
# extract all contents of the zip file
zip_ref.extractall('')
# close the file
zip_ref.close()

And there you have it.. All your images would be unzipped into a new folder Humans, which will be sitting in your current working directory on Azure.

--

--

Dr. Varshita Sher
The Startup

Senior Data Scientist | Explain like I am 5 | Oxford & SFU Alumni | https://podurama.com | Top writer on Medium