AWS S3 Multipart Upload/Download using Boto3 (Python SDK)

ankhipaul
Analytics Vidhya
Published in
4 min readJul 3, 2020

--

We all are working with huge data sets on a daily basis. Part of our job description is to transfer data with low latency :). Amazon Simple Storage Service (S3) can store files up to 5TB, yet with a single PUT operation, we can upload objects up to 5 GB only. Amazon suggests, for objects larger than 100 MB, customers should consider using the Multipart Upload capability.

AWS SDK, AWS CLI and AWS S3 REST API can be used for Multipart Upload/Download. For CLI, read this blog post, which is truly well explained.

We will be using Python SDK for this guide. Before we start, you need to have your environment ready to work with Python and Boto3. If you haven’t set things up yet, please check out my previous blog post here.

First, we need to make sure to import boto3; which is the Python SDK for AWS. Now create S3 resource with boto3 to interact with S3:

import boto3s3_resource = boto3.resource('s3')

When uploading, downloading, or copying a file or S3 object, the AWS SDK for Python automatically manages retries, multipart and non-multipart transfers. In order to achieve fine-grained control, the default settings can be configured to meet requirements. TransferConfig object is used to configure these settings. The object is then passed to a transfer method (upload_file, download_file) in the Config= parameter.

from boto3.s3.transfer import TransferConfigconfig = TransferConfig(multipart_threshold=1024 * 25, 
max_concurrency=10,
multipart_chunksize=1024 * 25,
use_threads=True)

Here’s an explanation of each element of TransferConfig:

multipart_threshold: This is used to ensure that multipart uploads/downloads only happen if the size of a transfer is larger than the threshold mentioned, I have used 25MB for example.

max_concurrency: This denotes the maximum number of concurrent S3 API transfer operations that will be taking place (basically threads). Set this to increase or decrease bandwidth usage.This attribute’s default setting is 10.If use_threads is set to False, the value provided is ignored.

multipart_chunksize: The size of each part for a multi-part transfer. Used 25MB for example.

use_threads: If True, parallel threads will be used when performing S3 transfers. If False, no threads will be used in performing transfers.

After configuring TransferConfig, lets call the S3 resource to upload a file:

bucket_name = 'first-aws-bucket-1'def multipart_upload_boto3():

file_path = os.path.dirname(__file__) + '/multipart_upload_example.pdf'
key = 'multipart-test/multipart_upload_example.pdf'

s3_resource.Object(bucket_name, key).upload_file(file_path,
ExtraArgs={'ContentType': 'text/pdf'},
Config=config,
Callback=ProgressPercentage(file_path)
)

- file_path: location of the source file that we want to upload to s3 bucket.
- bucket_name: name of the destination S3 bucket to upload the file.
- key: name of the key (S3 location) where you want to upload the file.
- ExtraArgs: set extra arguments in this param in a json string. You can refer this link for valid upload arguments.
- Config: this is the TransferConfig object which I just created above.

Similarly, for downloading file:

def multipart_download_boto3():

file_path = os.path.dirname(__file__)+ '/multipart_download_example.pdf'
file_path1 = os.path.dirname(__file__) key = 'multipart-test/multipart_download_example.pdf'

s3_resource.Object(bucket_name, key).download_file(file_path,
Config=config,
Callback=ProgressPercentage(file_path1)
)

-bucket_name: name of the S3 bucket from where to download the file.
- key: name of the key (S3 location) from where you want to download the file(source).
-file_path: location where you want to download the file(destination)
-ExtraArgs: set extra arguments in this param in a json string. You can refer this link for valid upload arguments.
-Config: this is the TransferConfig object which I just created above.

Please note that I have used progress callback so that I can
track the transfer progress. Both the upload_file and
download_file methods take an optional callback parameter. This ProgressPercentage class is explained in Boto3 documentation.

Interesting facts of Multipart Upload (I learnt while practising):

  1. In order to check the integrity of the file, before you upload, you can calculate the file’s MD5 checksum value as a reference. Say you want to upload a 12MB file and your part size is 5MB. Calculate 3 MD5 checksums corresponding to each part, i.e. the checksum of the first 5MB, the second 5MB, and the last 2MB. Then take the checksum of their concatenation. Since MD5 checksums are hex representations of binary data, just make sure you take the MD5 of the decoded binary concatenation, not of the ASCII or UTF-8 encoded concatenation. When that’s done, add a hyphen and the number of parts to get the ETag of the final object in S3.
  2. For a traditional PUT request, ETag of the object is the MD5 checksum of the file. However, for multipart uploads Etag is calculated based on a different algorithm.

Keep exploring and tuning the configuration of TransferConfig. Happy Learning!

For entire code reference, visit github.

--

--