AWS S3 MultiPart Upload with Python and Boto3

Niyazi Erdoğan
6 min readSep 21, 2018

--

Hi,

In this blog post, I’ll show you how you can make multi-part upload with S3 for files in basically any size. We’ll also make use of callbacks in Python to keep track of the progress while our files are being uploaded to S3 and also threading in Python to speed up the process to make the most of it. And I’ll explain everything you need to do to have your environment set up and implementation you need to have it up and running!

This is a part of from my course on S3 Solutions at Udemy if you’re interested in how to implement solutions with S3 using Python and Boto3.

First things first, you need to have your environment ready to work with Python and Boto3. If you haven’t set things up yet, please check out my blog post here and get ready for the implementation.

I assume you already checked out my Setting Up Your Environment for Python and Boto3 so I’ll jump right into the Python code.

First thing we need to make sure is that we import boto3:

import boto3

We now should create our S3 resource with boto3 to interact with S3:

s3 = boto3.resource('s3')

Ok, we’re ready to develop, let’s begin!

Let’s start by defining ourselves a method in Python for the operation:

def multi_part_upload_with_s3():

There are basically 3 things we need to implement: First is the TransferConfig where we will configure our multi-part upload and also make use of threading in Python to speed up the process dramatically. So let’s start with TransferConfig and import it:

from boto3.s3.transfer import TransferConfig

Now we need to make use of it in our multi_part_upload_with_s3 method:

config = TransferConfig(multipart_threshold=1024 * 25,           max_concurrency=10,multipart_chunksize=1024 * 25, use_threads=True)

Here’s a base configuration with TransferConfig. Let’s brake down each element and explain it all:

multipart_threshold: The transfer size threshold for which multi-part uploads, downloads, and copies will automatically be triggered.

max_concurrency: The maximum number of threads that will be making requests to perform a transfer. If use_threads is set to False, the value provided is ignored as the transfer will only ever use the main thread.

multipart_chunksize: The partition size of each part for a multi-part transfer.

use_threads: If True, threads will be used when performing S3 transfers. If False, no threads will be used in performing transfers: all logic will be ran in the main thread.

This is what I configured my TransferConfig but you can definitely play around with it and make some changes on thresholds, chunk sizes and so on. But let’s continue now.

Now we need to find a right file candidate to test out how our multi-part upload performs. So let’s read a rather large file (in my case this PDF document was around 100 MB).

First, let’s import os library in Python:

import os

Now let’s import largefile.pdf which is located under our project’s working directory so this call to os.path.dirname(__file__) gives us the path to the current working directory.

file_path = os.path.dirname(__file__) + '/largefile.pdf'

Now we have our file in place, let’s give it a key for S3 so we can follow along with S3 key-value methodology and place our file inside a folder called multipart_files and with the key largefile.pdf:

key_path = 'multipart_files/largefile.pdf'

Now, let’s proceed with the upload process and call our client to do so:

s3.meta.client.upload_file(file_path, BUCKET_NAME, key_path,
ExtraArgs={'ACL': 'public-read', 'ContentType': 'text/pdf'},
Config=config,
Callback=ProgressPercentage(file_path))

Here I’d like to attract your attention to the last part of this method call; Callback. If you’re familiar with a functional programming language and especially with Javascript then you must be well aware of its existence and the purpose.

What basically a Callback does to call the passed in function, method or even a class in our case which is ProgressPercentage and after handling the process then return it back to the sender. So with this way, we’ll be able to keep track of the process of our multi-part upload progress like the current percentage, total and remaining size and so on. But how is this going to work? Where does ProgressPercentage comes from? Nowhere, we need to implement it for our needs so let’s do that now.

Either create a new class or your existing .py, it doesn’t really matter where we declare the class; it’s all up to you. So let’s begin:

class ProgressPercentage(object):

In this class declaration, we’re receiving only a single parameter which will later be our file object so we can keep track of its upload progress. Let’s continue with our implementation and add an __init__ method to our class so we can make use of some instance variables we will need:

def __init__(self, filename):
self._filename = filename
self._size = float(os.path.getsize(filename))
self._seen_so_far = 0
self._lock = threading.Lock()

Here we are preparing our instance variables we will need while managing our upload progress. filename and size are very self-explanatory so let’s explain what are the other ones:

seen_so_far: will be the file size that is already uploaded in any given time. For starters, its just 0.

lock: as you can guess, will be used to lock the worker threads so we won’t lose them while processing and have our worker threads under control.

Here’s the most important part comes for ProgressPercentage and that is the Callback method so let’s define it:

def __call__(self, bytes_amount):

bytes_amount is of course will be the indicator of bytes that are already transferred to S3. What we need is a way to get the information about current progress and print it out accordingly so that we will know for sure where we are. Let’s start by taking thread lock into account and move on:

with self._lock:

After getting the lock, let’s first set seen_so_far to an appropriate value which is the cumulative value for bytes_amount:

self._seen_so_far += bytes_amount

Next is that we need to know the percentage of the progress so to track it easily:

percentage = (self._seen_so_far / self._size) * 100

We’re simply dividing the already uploaded byte size to the whole size and multiplying it by 100 to simply get the percentage. Now, for all these to be actually useful, we need to print them out. So let’s do that now. I’m making use of Python sys library to print all out and I’ll import it; if you use something else than you can definitely use it:

import sys

Now let’s use it to print things out:

sys.stdout.write("\r%s  %s / %s  (%.2f%%)" % (
self._filename, self._seen_so_far, self._size,
percentage))

As you can clearly see, we’re simply printing out filename, seen_so_far, size and percentage in a nicely formatted way.

One last thing before we finish and test things out is to flush the sys resource so we can give it back to memory:

sys.stdout.flush()

Now we’re ready to test things out. Here’s a complete look to our implementation in case you want to see the big picture:

import threadingimport boto3
import os
import sys
from boto3.s3.transfer import TransferConfigBUCKET_NAME = "YOUR_BUCKET_NAME"def multi_part_upload_with_s3():
# Multipart upload
config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10,
multipart_chunksize=1024 * 25, use_threads=True)
file_path = os.path.dirname(__file__) + '/largefile.pdf'
key_path = 'multipart_files/largefile.pdf'
s3.meta.client.upload_file(file_path, BUCKET_NAME, key_path,
ExtraArgs={'ACL': 'public-read', 'ContentType': 'text/pdf'},
Config=config,
Callback=ProgressPercentage(file_path)
)
class ProgressPercentage(object):
def __init__(self, filename):
self._filename = filename
self._size = float(os.path.getsize(filename))
self._seen_so_far = 0
self._lock = threading.Lock()
def __call__(self, bytes_amount):
# To simplify we'll assume this is hooked up
# to a single filename.
with self._lock:
self._seen_so_far += bytes_amount
percentage = (self._seen_so_far / self._size) * 100
sys.stdout.write(
"\r%s %s / %s (%.2f%%)" % (
self._filename, self._seen_so_far, self._size,
percentage))
sys.stdout.flush()

Let’s now add a main method to call our multi_part_upload_with_s3:

if __name__ == '__main__': 
multi_part_upload_with_s3()

Let’s hit run and see our multi-part upload in action:

Multipart upload progress in action

As you can see we have a nice progress indicator and two size descriptors; first one for the already uploaded bytes and the second for the whole file size.

So this is basically how you implement multi-part upload on S3. There are definitely several ways to implement it however this is I believe is more clean and sleek.

Make sure to subscribe my blog or reach me at niyazierdogan@windowslive.com for more great posts and suprises on my Udemy courses

Have a great day!

--

--

Niyazi Erdoğan

Senior Software Engineer @Roche , author @OreillyMedia @PacktPub, @Udemy , #software #devops #aws #cloud #java #python,more https://www.udemy.com/user/niyazie