Easy Guide to Downloading a Subset of Dataset from Kaggle using Pandas

supersjgk
6 min readJan 16, 2024

--

Use Pandas to download a customized data to your local machine without Kaggle API.

In this guide, we’ll explore the straightforward process of downloading a tabular subset from Kaggle to your local machine. There could be many reasons why you would want to do that. You want to work on your local machine but the dataset is too large to download and your Wi-Fi speed is great /s. Or you’re just interested in working with some rows/columns. This story will show you how to do that using Pandas. We’ll also explore how to use queries to select certain features of the dataset given some conditions.

Kaggle — datasets page

Contents

Prerequisites

  • Kaggle account

NOTE

We’ll be using Pandas to explore various methods to subset a dataset. But if you’re already know how to create a subset in the form of DataFrame, skip to this part. Moreover, we’ll only be dealing with csv files in this story.

Getting Started: Selecting a dataset

  • Log into your Kaggle account. Go to the datasets . Type a keyword for any dataset that suits your needs in the search bar. Then click on it.
  • In this story, we’ll be using Credit Card Fraud Detection dataset for demonstration. The method used below will work for any kind of tabular dataset, no matter how big it is.
Search your dataset
  • Click on New Notebook at the top of the page: Don’t worry, it’s very easy. Just follow along.
Open a new notebook
  • When you click on New Notebook, a notebook will open up with your chosen dataset loaded in the input directory which can be seen at the right side of the notebook.
Dataset loaded in the notebook
  • Now, you’ll also see some code lines at the center of the page, of which, we’ll use the one shown below. It’s already installed and imported in Kaggle notebook for you.
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

Run the code block by selecting the cell where this line is written. A small triangle will show up at the left of code cell. Click it. You can also run it by pressing Ctrl + Enter, or using the Run option in the top bar.

Pandas

Pandas is a famous python package for Data Analysis. We’ll be using it for data processing.

It can handle many data files like csv (most common format if you’re working with tabular data), json, text, excel, etc. You can read more about all the data formats compatible with Pandas here. Pandas loads data in the form of DataFrame object.

Since, our data is csv, (which you can check by looking at the extension of the data file loaded in the input directory of your Kaggle notebook), we’ll load it using the code shown below.

data = pd.read_csv("path/to/your/dataset")

Make sure to replace the path with your actual path. You can find it out by hovering your cursor over the data file in the input directory. A copy icon will show up, click it and the path will be copied. Now, you can simply paste it. Make sure it’s a string.

Copy file path

Now, you can use the following command to check if everything is correct till this point.

data.head() # Runnning this line will show the first 5 rows of the code
Viewing the data

Size of the dataset

You can check the number of rows and number of columns in the dataset using the following command:

data.shape # returns (number of rows, number of columns)

Selecting a subset of the dataset

Suppose your data has 1000000 rows but you only want, let’s say, 1000 rows. You can do this using any of the following lines.

  • Selecting 1000 rows and all columns from a starting index
subset = data[idx : idx + 1000] # selects 1000 rows starting from 'idx' index
  • Selecting 1000 rows and only some columns.
# Select specific columns. Pass the indices of columns as a list
subset = data.iloc[idx : idx+1000, [0,1,4,9]]

# Select continuous subset of columns.
subset = data.iloc[idx : idx+1000, 5:23] # selects 5th to 24th (end-1) column

# Select subset of columns by Names
subset = data.loc[idx : idx+1000, ["ColumnX", "ColumnY"]]
  • Selecting random rows
subset = data.sample(1000) # selects a random subset
  • Select rows based on the values of certain columns.
# select all rows with value of columnX > 45. You may use other operations
subset = data[data["columnX"] > 45]

# select only 1000 rows from the beginning with a condition on a column
subset = data[data["columnX"] > 45].head(1000)

# select rows between a range of indices with a column condition
subset = data[data["columnX"] > 45].iloc[startidx: endidx]

# select 1000 random rows - data should have 1000 rows with the condition
subset = data[data["columnX"] > 45].sample(1000)
  • Using query to select subset based on multiple column conditions. You have to use column names inside the queries. This is just like SQL if you’re familiar with it.
''' 
Selects 1000 random rows with the condition. Pass the column
Again, data should have 1000 rows with the condition satisfied.
'''
subset = data.query("Column2 > Column3").sample(1000)

'''
If the number of rows satisfying the condition is less than your desired
count, you can select all the rows using:
'''
subset = data.query("Column1 == Column2")

# multiple column conditions
subset = data.query("Column1 > 40 and Column 2 == 30")

subset = data.query("not (ColumnA < 190 or columnB == 0.5)")

Any of the above methods can be used in combination with any other method given that sufficient data is present.

Saving the subset

Now that you have selected your customized data, it’s time to save it.

subset.to_csv("filename.csv") # replace with your filename

# if you don't want to save the index
subset.to_csv("filename.csv", index=False)

Now, you’ll see a data file with the filename you chose appear in Output (/kaggle/working) directory at the right side of page like this:

Generated Subset data file

Downloading the subset to your local machine

  • Download directly

If the file is small enough and you want to download directly, just hover your cursor over the filename, you’ll see three dots appear at the right end of the filename. Click it and you’ll see a Download option. Simply click it and you’re done.

  • Download a zip file

If you want to zip and compress the file and then download it, you can do this using the following code.

from shutil import make_archive
make_archive(base_name='filename', format='zip', root_dir='path/to/file')
'''
replace filename with the name you want to give to your zip file
replace path/to/file with the path of your subset file in
/kaggle/working directory
You can find this by hovering over the filename, a copy option will appear,
click on it and the path will be copied.
'''

Once you run the code, a compressed zip file will appear in your /kaggle/working directory. You can download it using the above steps.

And that’s it. That’s how you download a subset of a large dataset from Kaggle to your local machine.

Shut down the notebook

Make sure to shut down the notebook and stop the session. You can do this by pressing the icon at the top bar of the page as shown in the image below. Or you can go to Run -> Stop Session.

Shut-down the notebook

If you’re reached this far, thank you for reading my story. Feel free to leave a comment in case you run into any problems.

--

--

supersjgk

Computer Scientist # This comment is here to create the illusion of documentation. # Want me to code something and weave a story around it? Let me know!