Use Python to Upload CSV and Parquet Files to Amazon S3

Amiri McCain
7 min readJan 23, 2024

--

We will be working with the database from Open Food Facts

The Solution

To skip the scenic route and go straight to the code solution, scroll down👇🏽 or cmd+f 🍎 (or ctrl+f 🪟) to “Solution Code” or “Solution Code Breakdown” otherwise, enjoy the stroll 🚶🚶‍♀🚶‍♂️️

Abstract

This article is the second in a series of articles working with the data from Open Food Facts. The first article, “Use Python to Read and Download a Large CSV from a URL,” discussed how to download Open Food Facts’ large (~10GB) CSV (tab delimited) file from a URL to a Pandas dataframe and then finally how to save that large dataframe into “n” number of smaller CSV files and Parquet files.

In this article, we will now upload our CSV and Parquet files to Amazon S3 in the cloud. Amazon provides a very clean and easy to use SDK for uploading or downloading large files.

Photo by nrd on Unsplash

Procedure at a High Level

  1. Import the needed packages and sort, by a number in the filename, the directory of files.
  2. Pass this sorted directory to a single for loop and loop through each file, in numerical order, in the directory and call the upload_with_default_configuration() function from the AWS SDK.
  3. That’s it! The AWS SDK does all of the real heavy lifting work. There’s some simple error handling that I’ve included in the loop, and I measure the upload time in seconds as I loop through each file. When the file upload is complete, I print the name of that file and how long it took, in seconds, to upload to S3. And finally, I add up all of the individual upload times to give me a rough calculation, in minutes, of how long everything took.

The procedure is the same for both CSV and Parquet and I use two separate loops to accomplish this, one for CSV and one for Parquet.

This article assumes that you have already setup your AWS credentials configuration file. If not, you will need to pass your access key and secret access key as parameters in the code (hardcoding the keys is not recommended) where you call the AWS SDK function. For help setting up your AWS credentials, head on over to the article “Install AWS CLI and configure “credentials” and “config” files on a Mac

Solution Code 🧭

Solution Code Breakdown

I’m not going to cover the code in the AWS SDK function call, but it does seem worth noting that the docstring for the upload_with_default_configuration() function clearly states that this is to “Upload a file from a local folder to an Amazon S3 bucket,” note that it says “file” singular, not “files” plural. Thus, we have to put a loop around this function to upload multiple files.

Also, I will only cover the code for the CSV files. The loop for the Parquet files is identical other than it is only processing Parquet files rather than CSV files.

  1. Import the needed packages and sort, by a number in the filename, the directory of files:
import os
import time
import re

# get list of .csv files, only
paths_sorted = [each for each in os.listdir('chunk/csv/') if each.endswith('.csv')]
# sort files by the number
paths_sorted.sort( key=lambda x: int( (re.search('_(.*).csv', os.path.basename(x)).group(1)) ) )

I use list comprehension along with os.listdir() and endswith()to create a list of existing files that have the .csv extension. To perform the actual sort, I use the sort(key=) function, along with the search function in Python’s regular expression (regex) library (re.search(pattern, string, flags=0)). For the re.search('_(.*).csv' portion, it looks at the base filename, finds the underscore, then the first . (reading left to right) matches any character and * matches one or more characters of any character. This flexibility allows the number to be 0 or 1 or 10 or 1000 or 10000, etc. and finally group(1) grabs just the number portion. I then use the int() function to convert this result to an integer. In this way, I’m able to successfully sort a list of filenames based on the number in the filename, thus I must ensure, for my particular regex, that the filename format is always “somefilename_x.csv” where x = some integer value.

2. Pass this sorted directory to a for loop and loop through each file in the directory and call the upload_with_default_configuration() function from the AWS SDK:

time_list = []
for file in paths_sorted:
file_path = os.path.join('chunk', file)

try:
# upload file to s3
start = time.time()
upload_with_default_configuration('./chunk/csv/' + file, \
bucket_name = 'amiriscratch', \
object_key = 'openfoodfacts/chunk/csv/' + file, \
file_size_mb = 50)
stop = time.time()
print(f" Uploaded file '{file}' to S3 in {round(stop-start,1)} seconds")

time_list.append(stop-start)

except Exception as e:
print(f"Error processing {file}: {str(e)}")

print(f"Completed upload of {len(time_list)} files in {round(sum(time_list)/60, 1)} minutes")

Upload Successful

Notice from the “Last modified” column datetime stamp in the S3 images below that the files were in fact uploaded in numerical order, rather than randomly as they were before I implemented the sorting code.

Also take note of how efficient the Parquet format is. The Parquet files are nearly 10x smaller in size as compared to CSV and as you might expect, uploading Parquet files was also nearly 10x faster as compared to the CSV files.

CSV Upload to S3:

Parquet Upload to S3:

Optimization Ideas

This code is not optimized. This is just to get us up and running. What are some optimizations we would do before using something like this in production? Short answer is that there is a lot to tidy up, but here’s a short list:

  • Use Python scripts (.py files) instead of a Jupyter Notebook.
  • Optimize the file paths in my S3 bucket so that there are directories for “raw” data, “prepped” data, and clean “curated” data.
  • Turn AWS SDK Python code into a file and add it to my custom library of reusable code.
  • Turn the path sorting code into a function and add it to my custom library of reusable code.
  • Remove print statements, especially the one after each successful file load. Would be a good idea, however, to log this information in a log file.
  • Turn the for-loop code into a function and add it to my custom library of reusable code. There are a number of ways to do this, but the optimal way, in my mind, would be to make the function input take a list of strings. To take this to the next level, you could create a Python class, say we call it AWS_File_Upload. What if you want to pass in just one file to your class or rather to a method in your class? Well, you’ll still have to input it as a list, it will just be a list of one string. However, if you were to make this polymorphic, you could input a string, or a list of strings and this function would work just the same and without the user having to do anything special and without causing errors. This seemingly small change can have a huge positive impact on your coding productivity and code reuse.
  • Years ago, a coding instructor (shoutout to Nancy Henson at www.creativityinbusiness.com!) taught me that it typically takes 4x as long to create reuseable code as it does to create something quick and dirty for a one-time use. If you plan on using your code more than once, my experience is that it is worth the time and effort to tidy your code up so that it is ready for reuse.

It is good practice to write clean code, even in the Proof of Concept (POC) stage, however, you do not want to spend too much time prematurely optimizing code at this stage.

Questions

  • Is there a way to upload an entire directory of files to S3 with one AWS SDK function call, so that I do not have to write Python code to loop through each file? I’m guessing that putting my for loop inside of a function (and then a library) is the answer to that question, but please do share if there is a better way, especially if it’s a solution that AWS provides!
  • Is there a way to use any of AWS’s tools to accelerate the uploading of files?

Suggestions?

Any suggestions on how I can refactor this code to make it better or more efficient? What other optimization ideas do you have to prepare this small piece of code for production?

What’s Next?

I will continue to build on this project. Next, in an upcoming article, I will be pulling the CSV and Parquet files from S3 into Snowflake ❄️.

Photo by Marc Newberry on Unsplash

Before you go!

Did you see what happens when you click and hold the clap 👏🏽 button? You could clap up to 5️⃣0️⃣ times, that is, if you enjoyed the article. Try it out! 😀

Author: Amiri McCain, Data Engineer

--

--

Amiri McCain

Data engineering, cloud, and electronics tech enthusiast.