How to upload to S3

Sougata Khan
DigIO Australia
Published in
4 min readApr 1, 2021
Uploading files to S3 using putObject from memory and file using Localstack and AWS. Multipart upload and async multipart upload in Part 2

Introduction

I recently had to work with S3 & AWS Java SDK for a few different file operation scenarios. I was aware of a few different approaches and learnt a few more. Surprisingly there was actually more than one right way of using the SDK. I’ll be talking about the AWS SDK for Java 2 using Java 11.

S3 Localstack configurations

Localstack is the best tool for working with AWS locally. It provides a local test framework for developing against AWS. In my experience using actual AWS resources for local development is hard to maintain and work with, due to permission issues, costs and the need to be connected to AWS all the time. Using Localstack provides a good dev experience both during onboarding and day-to-day work.

Setting up Localstack for basic scenarios is straightforward using docker-compose. Stay tuned for a future post on an advanced setup. However, on the application side, we have to ensure that the S3Client is initialised correctly to work with Localstack. As would be expected the SDK by default connects to AWS while we would want it to connect to Localstack on localhost:4567 instead when developing locally.

Here I have overridden the endpoint of the S3 client, to point to Localstack instead of AWS. accessKey & secretKey can be of any value but must be set. The region must be consistent across all your applications but can be any valid AWS region.

.pathStyleAccessEnabled(true) is an additional configuration we have to set for Localstack only as the SDK uses a DNS style access to buckets on AWS, e.g bucket-name.s3.amazonaws.com. This would however translate to bucket-name.localstack on your local machine. Since this is not a valid hostname, we switch to an older behaviour where the access pattern is based on paths like s3.amazonaws.com/bucket-name and it translates really well to localstack/bucket-name.

Data generation

While most S3 upload and download operations can be performed from and to disk, I am interested in the flow that includes the data generated by code. To achieve this, I have used a library called EasyRandom, which creates objects from a given class using random values for the fields.

This is an example of data generated by simply calling the method new EasyRandom().nextObject(SampleData.class .

Simple S3 upload

The simplest way to upload to S3 is to use the putObject method

This method creates N objects and then serialises them into a JSON array. This final output is then sent to S3 putObject. I then query the size of the uploaded object and return it. This step isn’t necessary but I’m using it as verification. While this is a simple technique the downsides are that we are limited by the size of available heap size and certain to hit Out of Memory (OOM) Exception for large files.

S3 upload with an intermediary file

Since the SDK supports uploading from a file, we can always use an intermediary file between the data generation and file upload. This solves the memory problem to some extent. The total time increases because we are now additionally writing to a file and then reading from the file before uploading to S3. However, this could be a quick solution if you have access to enough disk space. This solution is not ideal when running on lambda or Kubernetes or if you want to optimise for time. With this change, I was able to upload larger files to Localstack without hitting the Out of Memory (OOM) Exception.

By writing each line into the file we never use more memory than is available.

Then we use the SDK to upload the file to S3. While this ensures that we will not encounter an Out of Memory (OOM), it introduces unnecessary steps of file writes and reads.

Conclusion

While Localstack is really good for developing and basic validation of your code flow. Anything beyond a few megabytes starts getting slow to test and doesn’t reflect how AWS will behave in terms of performance. I ran both simple and file-backed uploads from a single t3.medium instance and here are the average response times of multiple iterations.

It’s quite clear that the File-based uploads surpass simple uploads, but even that falls short beyond 5GB of data. We encounter the exception —

software.amazon.awssdk.services.s3.model.S3Exception: Your proposed upload exceeds the maximum allowed size (Service: S3, Status Code: 400, Request ID: 3CMX1G0GQ9338S7E

This is well documented and expected behaviour as AWS only supports a maximum of 5GB file uploads using putObject. The only solution beyond this is to use multipart uploads.

In the next blog in the series, I will demonstrate how to efficiently upload files larger than 100MB using the Multipart upload feature provided in the SDK.

Update: Part 2 is now up, feel free to continue reading this series at How to multipart upload to AWS S3.

Originally published at https://codeiscanon.com/how-to-upload-to-aws-s3/on April 25, 2021.

--

--