Implementing AWS S3 Multipart Uploads
This post is co-authored with Todd Morse.
At the Chan Zuckerberg Initiative (CZI), one of the technology tools we’ve built to support open science is Chan Zuckerberg ID (CZ ID) — a free, cloud-based metagenomics platform for researchers to rapidly identify new and emerging infectious diseases. To perform analysis with CZ ID, users begin by uploading files containing genomic sequencing data that are as large as several gigabytes. If a user’s file upload fails midway through, they have to restart the process from the very beginning — which can make it difficult to ever successfully upload a large file to CZ ID over an unreliable internet connection, which is a particular concern for global users. To make this interruption less frequent, we recently implemented resumable AWS S3 multipart uploads in CZ ID, with the goal of increasing the reliability and success rate of large file uploads.
The basis for our upload solution was the `Upload` library from @aws-sdk/lib-storage in AWS JS SDK. The `Upload` library “allows for easy and efficient uploading of buffers, blobs, or streams, using a configurable amount of concurrency to perform multipart uploads where possible.” In other words, it’s a complete out-of-the-box solution for performing multipart uploads of large files. However, one critical feature that we required was the ability to resume failed uploads, which was not supported in this `Upload` library solution. Below, we’ve captured our approach to adding resumable downloads and implementing solutions to a few challenges we encountered along the way.
Configuring CORS to Receive ETags for Completing Multipart Upload
As described in the AWS documentation, S3 returns an “ETag” header for each part of a file upload. After all parts of a file are uploaded, the `ETag` header for each part must be sent to S3 to complete the multipart upload. A typical web application that accesses data in S3 will require CORS configuration for the appropriate domains. By default, the S3 CORS configuration isn’t set up to return the ETag, which means the web application can’t receive the `ETag` header for each uploaded part — rendering the multipart upload incomplete.
To fix this, the S3 CORS configurations needs to be updated to allow headers to complete multipart uploads. See below for a sample S3 JSON CORS configuration `ExposeHeaders` section:
"ExposeHeaders": [ "ETag",]
For more information, see the AWS documentation for an example CORS configuration.
As part of our multipart uploads feature, we leveraged AWS’s built-in support for several common checksum algorithms in S3. When using an AWS SDK, the checksums are invoked by simply passing the `ChecksumAlgorithm` parameter when creating the S3 Client. However, similar to the ETag header, the checksums are also returned in headers that are not exposed by default in S3, so the S3 CORS configuration must be updated with the desired checksum header. See below for a sample S3 JSON CORS configuration `ExposeHeaders` section:
"ExposeHeaders": [ "x-amz-checksum-sha256",]
Support Resuming Multipart Uploads in AWS Uploads module
One benefit of using multipart uploads is the ability to resume uploads that fail. Since the CZ ID CLI uploader implements multipart uploads, our goal was to bring the web application in line with the CLI.
In the AWS JS SDK, the `Upload` class uses multipart uploads for files that are large enough (greater than 5MB) to be split into parts, but users aren’t able to resume uploads after being paused. But we still wanted to use the `Upload` class, since it provides a clean and well-tested Typescript implementation of file uploads.
To meet this challenge, we created a fork of the AWS JS SDK, which allows for resumable uploads. In our fork, the `Upload` class optionally accepts an upload ID. When a new multipart upload is created, the multipart upload ID is passed back to the caller via the callback function. If an upload ID is passed in when instantiating the `Upload` class, then it will query S3 for existing uploaded parts and upload only the parts that have not already been uploaded. This pull request in our fork implements the resuming functionality.
For better file integrity, our forked `Upload` class also verifies that the SHA256 checksum for each uploaded part matches the file part locally, which protects against modifications to the local file after the original upload attempt.
One last thing to note about the resume feature is that the AWS entity (user, group, or role) that performs the upload needs permission to the action `s3:ListMultipartUploadParts` to get the list of previously uploaded file parts.
Downloading Your Uploads
At the onset of implementing these capabilities through AWS, uploads were working perfectly. However, when we started testing out the feature end-to-end, the checksums failed when our pipeline downloaded the files for processing. After some investigation, we determined it was a problem with how we were downloading the files to begin with. To make downloads faster internally, we had already begun leveraging multipart downloads with s3parpc. s3parcp is a CLI tool that we made as a wrapper around the AWS Go SDK’s parallelized multipart downloader. We have had excellent performance by leveraging s3parpc, especially on machines with high bandwidth.
That said, s3parcp was downloading file parts that were a different size from the parts we initially uploaded — which is an issue due to how checksums are computed. When dealing with large files, we want to avoid computing the whole checksum at once. To accomplish this, S3 stores a checksum for each part, then sums up the checksum for each part to create the global checksum. In order for the final sum to be the same, the checksum for each part needs to be consistent.
The easiest way to keep these checksums consistent would be to set the same part size for downloading as we were using for uploading. But the problem with this approach is that we have multiple ways to upload files — some go through the flow described earlier while others are multipart uploads from other clients and even direct S3 transfers that have different part sizes.
To account for this challenge, we updated the s3parcp downloader to determine the correct part size to use for downloading. Before downloading an object, the downloader will first retrieve the size of the object’s first part using the `ListParts` endpoint in S3 and limiting the number of parts listed to one. That part size is then standardized as the part size for that multipart download. We only need the size of the first part, because every object we upload uses the same size for all of its parts — except the final part, which is just the remainder of the object.
Out-of-the-box solutions like AWS S3 multipart uploads can be a great solution to help make large file uploads more seamless — but unforeseen challenges are par for the course with any platform integration. By sharing a summary of the primary challenges we encountered while implementing multipart upload, we hope it can serve as a helpful resource for others who have run into similar challenges.
Lastly, we’d also like to thank the CZ ID engineering and quality assurance teams for contributions in the design, review, and testing of this feature.