EXPEDIA GROUP TECHNOLOGY — SOFTWARE

How to Upload Large Files to AWS S3

Using Amazon’s CLI to reliably upload up to 5 terabytes

Harish Kotha
Expedia Group Technology

--

A person pushes a cardboard box labeled “5GB files” into a giant computer monitor.
Image by the author

In a single operation, you can upload up to 5GB into an AWS S3 object. The size of an object in S3 can be from a minimum of 0 bytes to a maximum of 5 terabytes, so, if you are looking to upload an object larger than 5 gigabytes, you need to use either multipart upload or split the file into logical chunks of up to 5GB and upload them manually as regular uploads. I will explore both options.

Multipart upload

Performing a multipart upload requires a process of splitting the file into smaller files, uploading them using the CLI, and verifying them. The file manipulations are demonstrated on a UNIX-like system.

  1. Before you upload a file using the multipart upload process, we need to calculate its base64 MD5 checksum value:
$ openssl md5 -binary test.csv.gz| base64a3VKS0RazAmJUCO8ST90pQ==

2. Split the file into small files using the split command:

Syntax
split [-bbyte_count[k|m]] [-lline_count] [file [name]]
Option -b Create smaller files byte_countbytes in length.
`k' = kilobyte pieces
`m' = megabyte pieces.
-l Create smaller files line_count lines in length.

Splitting the file into 4GB blocks:

$ split -b 4096m test.csv.gz test.csv.gz.part-$ ls -l test*
-rw-r--r--@ 1 user1 staff 7827069512 Aug 26 16:20 test.csv.gz
-rw-r--r-- 1 user1 staff 4294967296 Aug 26 16:36 test.csv.gz.part-aa
-rw-r--r-- 1 user1 staff 3532102216 Aug 26 16:36 test.csv.gz.part-ab

3. Now, multipart upload should be initiated using the create-multipart-upload command. If the checksum that Amazon S3 calculates during the upload doesn’t match the value that you entered, Amazon S3 won’t store the object. Instead, you receive an error message in response. This step generates an upload ID, which is used to upload each part of the file in the next steps:

$ aws s3api create-multipart-upload \
--bucket bucket1 \
--key temp/user1/test.csv.gz \
--metadata md5=a3VKS0RazAmJUCO8ST90pQ== \
--profile dev
{
"AbortDate": "2020-09-03T00:00:00+00:00",
"AbortRuleId": "deleteAfter7Days",
"Bucket": "bucket1",
"Key": "temp/user1/test.csv.gz",
"UploadId": "qk9UO8...HXc4ce.Vb"
}

Explanation of the options:

--bucket bucket name

--key object name (can include the path of the object if you want to upload to any specific path)

--metadata Base64 MD5 value generated in step 1

--profile CLI credentials profile name, if you have multiple profiles

4. Next upload the first smaller file from step 1 using theupload-part command. This step will generate an ETag, which is used in later steps:

$  aws s3api upload-part \
--bucket bucket1 \
--key temp/user1/test.csv.gz \
--part-number 1 \
--body test.csv.gz.part-aa \
--upload-id qk9UO8...HXc4ce.Vb \
--profile dev
{
"ETag": "\"55acfb877ace294f978c5182cfe357a7\""
}

In which:

--part-number file part number

--body file name of the part being uploaded

--upload-id upload ID generated in step 3

5. Upload the second and final part using the same upload-part command with --part-number 2 and the second part’s filename:

$ aws s3api upload-part \
--bucket bucket1 \
--key temp/user1/test.csv.gz \
--part-number 2 \
--body test.csv.gz.part-ab \
--upload-id qk9UO8...HXc4ce.Vb \
--profile dev
{
"ETag": "\"931ec3e8903cb7d43f97f175cf75b53f\""
}

6. To make sure all the parts have been uploaded successfully, you can use the list-parts command, which lists all the parts that have been uploaded so far:

$ aws s3api list-parts \
--bucket bucket1 \
--key temp/user1/test.csv.gz \
--upload-id qk9UO8...HXc4ce.Vb \
--profile dev
{
"Parts": [
{
"PartNumber": 1,
"LastModified": "2020-08-26T22:02:06+00:00",
"ETag": "\"55acfb877ace294f978c5182cfe357a7\"",
"Size": 4294967296
},
{
"PartNumber": 2,
"LastModified": "2020-08-26T22:23:13+00:00",
"ETag": "\"931ec3e8903cb7d43f97f175cf75b53f\"",
"Size": 3532102216
}
],
"Initiator": {
"ID": "arn:aws:sts::575835809734:assumed-role/dev/user1",
"DisplayName": "dev/user1"
},
"Owner": {
"DisplayName": "aws-account-00183",
"ID": "6fe75e...e04936"
},
"StorageClass": "STANDARD"
}

7. Next, create a JSON file containing the ETags of all the parts:

$ cat partfiles.json
{
"Parts" : [
{
"PartNumber" : 1,
"ETag" : "55acfb877ace294f978c5182cfe357a7"
},
{
"PartNumber" : 2,
"ETag" : "931ec3e8903cb7d43f97f175cf75b53f"
}
]
}

8. Finally, finish the upload process using the complete-multipart-upload command as below:

$ aws s3api complete-multipart-upload \
--multipart-upload file://partfiles.json \
--bucket bucket1 \
--key temp/user1/test.csv.gz \
--upload-id qk9UO8...HXc4ce.Vb --profile dev
{
"Expiration": "expiry-date=\"Fri, 27 Aug 2021 00:00:00 GMT\", rule-id=\"deleteafter365days\"",
"VersionId": "TsD.L4ywE3OXRoGUFBenX7YgmuR54tY5",
"Location":
"https://bucket1.s3.us-east-1.amazonaws.com/temp%2Fuser1%2Ftest.csv.gz",
"Bucket": "bucket1",
"Key": "temp/user1/test.csv.gz",
"ETag": "\"af58d6683d424931c3fd1e3b6c13f99e-2\""
}

9. Now our file object is uploaded into S3.

Screenshot of the AWS console showing a single S3 object of 7.3GB

The following table provides multipart upload core specifications. For more information, see Multipart upload overview.

Screenshot of table of capacities (upload size, part count, etc.) from the AWS web site at the link provided above

Finally, Multipart upload is a useful utility to make the file one object in S3 instead of uploading it as multiple objects (each less than 5GB).

decorative separator

Split and upload

The multipart upload process requires you to have special permissions, which is sometimes time-consuming to obtain in many organizations. You can split the file manually and do a regular upload of each part as well.

Here are the steps:

  1. Unzip the file if it is a zip file.
  2. Split the file based on the number of lines in each file. If it is a CSV file, you can use parallel --header to copy the header to each split file. I am splitting here after every 2M records:
$ cat test.csv \
| parallel --header : --pipe -N2000000 'cat >file_{#}.csv'

3. zip the file back using gzip <filename> command and upload each file manually as a regular upload.

http://lifeatexpediagroup.com

--

--