Loading Kaggle dataset to AWS S3 using Boto3

Sruthy Antony
6 min readSep 21, 2020

--

This activity intends to download the dataset from the Kaggle site using the available Kaggle API and load the data to an AWS S3 bucket with the help of python Boto3. Boto3 is a python SDK to facilitate various services provided by AWS. Instead of manually creating the storage buckets and objects in the S3 storage, we can create them using Boto3. All the errors and blockers encountered during this integration are included for aiding future references.

Given below indicates the pre-conditions to be satisfied before proceeding.

  • An active account in Kaggle (https://www.kaggle.com/)
  • An active account in AWS (https://aws.amazon.com/)
  • Python should be installed in the machine

The entire activity of fetching data from Kaggle to S3 has been broken down as different steps. Which includes ;

  • Step 1: Installation of Kaggle CLI
  • Step 2: Installation of AWS CLI
  • Step 3: Installation of Boto3
  • Step 4: AWS S3 bucket creation using Python Boto3
  • Step 5: Kaggle Dataset download
  • Step 6: File uploads to AWS S3 using Boto3

We can now delve into the detailed code snippets used for each of the steps mentioned above.

Step 1: Installation of Kaggle CLI

As a first step, we need to install the Kaggle CLI. This helps with the interaction with the Kaggle datasets such as listing the datasets, downloading data, etc. We can install the Kaggle CLI using the command below.

pip install kaggle
installation of Kaggle CLI

Kaggle API Tokens

As a next substep, we need to generate the API tokens from the Kaggle site and move the downloaded tokens to the location /.kaggle/kaggle.json.

Kaggle →Accounts → API → Create a token

Kaggle API token generation

So, now the /. Kaggle folder holds the downloaded json API tokens. Following this, we can download the datasets from the Kaggle site (step 5)

Step 2: Installation of AWS CLI

Similar to the Kaggle CLI, the installation of AWS CLI facilitate the interactions with the AWS. The CLI can be installed as :

pip install awscli

Configuring the AWS configuration file with the IAM user details is important as it deals with the security and access of the AWS services. For this purpose, a new IAM user, say testboto3 needs to be created under the AWS IAM roles and the credentials must be copied to a new file called ‘credentials’ created under /.aws (~/.aws/credentials) path. The created AWS user must be given the necessary permissions to access the S3 services.

Along with this, the desired region also needs to be set up in a file called ‘config file’ of AWS. this is done as :

— -Config File —[default]region= Desired region

So once all the configuration settings are done, your .aws folder will have two files namely config and credentials as shown below

The contents of the file will be as shown below:

AWS config file

Step 3: Installation of Boto3

Before starting to use the Boto3 SDK, it must be installed in the machine. This is done as

pip install boto3

Now the SDK is available for you to further proceed.

Step 3: AWS S3 bucket creation using Python Boto3

Boto3 is a known python SDK intended for AWS. This helps for the creation and other operations of AWS resources such as S3 buckets using the python scripts instead of manual operations.

To start with the S3 bucket creation, we need to first import the boto3.

import boto3

What boto3 does is it interact with the AWS API’s to perform the necessary operations. Either of boto3’s client() or resource() can be used to interact with the AWS API’s. The usage of the client is given as a sample in this blog post.

s3resource=boto3.client('s3','us-east-1')

Here we have mentioned the intended service to get connected (S3) and the region details of the s3 bucket to be created. Instead of hard-coding the region values, we can even automate the region details using the region tied with the session details.

Following this step is the command for the S3 bucket creation.

s3resource.create_bucket(Bucket='filetransfers3tords')

pass the desired name of the bucket to be created in the specified region.

On the successful creation of the bucket, we receive a response like below.

S3 bucket creation responses

Step 5: Kaggle Dataset download

To view the available datasets in the Kaggle site, we can use the command

kaggle datasets list
List Kaggle datasets

To download the dataset,

kaggle datasets download dataset name -p destination -unzip

-p specifies the desired target location and -unzip helps to unzip the zipped data while downloading it.

Dataset download

Step 6: File uploads to AWS S3 using Boto3

Now we are all set to upload the files to the created bucket of AWS S3. This can be easily done using the boto3 SDK as given below.

s3resource.upload_file(KaggleDatasetname,bucket_name,S3storedkaggledatasetname)

We need to specify the required parameters in order to upload the file. This includes:

KaggleDatasetname : Name of the kaggle dataset downloadedbucket_name: Name of the bucket which we have createdS3storedkaggledatasetname: The desired name for the dataset loaded into the S3 bucket.

This copy operation takes a few minutes and once it is done, we can refresh the AWS S3 user interface to view the newly created bucket and data within the bucket.

We can see that the bucket is created in the AWS S3.

Bucket created in AWS S3

And the files are loaded into the bucket.

Files loaded in the S3 bucket

Common Error Encountered!

Here I am listing the common errors which I have faced during this integration. This will be helpful for those who are making their first steps in these kinds of hands-on activities.

  1. Error in the bucket name

Resolution:

The bucket name should not have

  • uppercase characters, underscores (_), should be 3 and 63 characters long, do not end with a dash, should not have two consecutive periods, and finally should not include dash immediately to periods.

Reference:

Python, Boto3, and AWS S3: Demystified — Real Python

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written…realpython.com

Part 1: How to copy Kaggle data to Amazon S3

This is part-1 of the blog series — How to analyze Kaggle data with Apache Spark and Zeppelin. This post provides a…confusedcoders.com

--

--