What are the options for using AWS S3 from my Java App

Satyajit Paul
8 min readMar 16, 2019

--

AWS S3 gives multiple options for your Java Application to read, write files into S3 object store. Question is — which is the best option for a Java Application? Answer is — it depends. We will look into the simple example implementation first and then go over all the options one by one. By the end of this article you should have a fairly good idea about which option really suits your need.

Simplest option is to use S3Client and upload a single file.

Exhibit 1

Above example is good for simple use-cases but things become complicated as you start considering factors like — time it takes to read or write a file (latency), number of files to be transferred in a given time (throughput), sizes of the files to be transferred(transfer rate), multiple java apps need to write to same file, immediate availability in S3, whether multiple java apps must have the same view of the files written (consistency vs eventual consistency), files are one time write vis-a-vis multiple times updates, are all java apps in same network or across geographies (distributed application), whether your app is running in AWS same region/availability zone as S3 bucket or in a different environment (same cloud availability zone vs on-premise), if you want to avoid vendor lock-in and make AWS S3 bucket to be replaceable by other similar service from Google, IBM or Microsoft and last, if it’s an existing application trying to leverage S3 or it’s a net new application being developed.

As you keep adding each of the factors to your list of requirements, nature of solution also changes. For this article, I will look into only the file write or upload scenarios. So, lets go over the available options for file upload to AWS S3.

Write a single file or a string content to S3 using AWS SDK for Java

Good for single file operation, will not scale if you need to upload a large file (> 5 GB) or if you want to upload large number of files. It doesn’t support multipart upload. I tried uploading a 224 MB file from my home MacBook and it took around 60–90 sec to complete the upload. Results were not much different when I tried from servers running outside the specific AWS region. However, when I tried from an EC2 instance in the same region as the S3 bucket, it was way better — 3.7 to 3.8 sec.

Exhibit 2

Multipart file upload to S3

Good for large file upload, gives flexibility of low level implementation where you can split a large file into multiple smaller parts. Please note AWS S3 doesn’t support uploading file larger than 5 GB in single operation, so you need it if you have to deal with large files. However, it may not be an optimal choice for large number of small files or mixed size files. If you don’t not know the sizes of the files upfront, you will have to take care of the optimization in your code. I tried uploading the same 224 MB file using Multipart upload API from my MackBook, it took around 95–98 sec to complete the upload. The file was broken into multiple parts of 5 MB each for upload, I am guessing my home network was the bottleneck here. Results were better when I tried uploading from servers but not good enough, it still took more than 40 sec. When I tried from an EC2 instance in the same region as the S3 bucket, results were way better — 9.1 to 16.3 sec while S3Client based Simple upload from same EC2 instance was still much better, it took ~4 sec.

Exhibit 3

Using AWS Transfer Manager for file upload to S3

This is the best option if you are using AWS SDK for uploading to S3, can handle single file upload, or a list of files upload or a directory upload. Internally it takes care of different scenarios, gives you options to track progress or to pause an upload. Interestingly, for the same 224 MB file, from the same home network, it took around 38–40 sec to upload. This was performing much better from the regular servers — roughly around 12 to 20 sec and on an EC2 instance in same region, it was around 7 sec.

Exhibit 4

Use a FUSE based files system like s3fs-fuse

s3fs-FUSE can map an S3 bucket to a local drive. Once that’s done, you can do regular file operations like file copy, listing, delete etc. and it will be reflected on your s3 bucket. There is an option to cache the files on a local temp directory. This is a pretty popular open-source option for mounting an S3 bucket as a local file system, but I was not that lucky — almost 50% of time it didn’t work and when it worked it took around 65–90 sec to complete the upload for the same 224 MB file. Adding a cache location didn’t help much for upload scenario.

s3fs-FUSE supports AWS S3, Azure Object store, Google object store and IBM Object store.

Use s3proxy for file upload

s3proxy uses s3fs-FUSE internally and makes it accessible over http end point. So, users don’t have to mount a drive to the servers where Application is hosted, rather Applications can use the AWS S3 API and use the s3proxy end point to get the s3 objects as regular files. Since it’s built on top of s3fs-FUSE, you can change the underlying object store at anytime you want and Application code can continue to use the S3 API to access the files. I have not run any test for this. I would think it will have an additional overhead on top of s3fs-FUSE performance level due to http layer.

Use AWS Storage Gateway > File Gateway to upload files to S3

AWS File Gateway gives you couple of different options for mounting your S3 bucket, all the options require you to have a dedicated VM that will work as the File Gateway, you can use — an EC2 instance if you are running B2Bi on AWS Cloud, or use VMWare ESXi or Microsoft Hyper-V options if you are running B2Bi on premise. I could try only the option of converting an EC2 instance into File Gateway. It took around 4 sec for uploading 224 MB file to an s3 bucket in the same region as the EC2 instance.

Use Hardware based File Gateway for S3

There is one off-the-self solution: AWS Storage Gateway > File Gateway Appliance. This is a AWS Storage Gateway pre-loaded on a Dell EMC PowerEdge server. This supports NFS mounting like all other File Gateway solutions. I don’t have the latency data for my usual 224 MB file transfer. But given this is a hardware based commercial solution, I would expect this to be the top performing solution. If you have any data based on your experience with this appliance, then I would love to update this post with your findings, needless to say with appropriate credit :).

Here is the summary of all performance/latency data I gathered through testing.

Exhibit 5

When you think of all the options above and the time taken to upload a 224 MB file, please keep in mind that it takes less than 1 sec on an EC2 instance to copy a 224 MB files to a local drive using Java IO and it takes around 0.3–0.7 sec using Java NIO.

So, if you are planning to move to AWS s3 for file storage from regular file system or NFS for your Java App that may run on your on-prem data center as well as in cloud, the biggest factor to consider is latency. A jump from 0.5 sec or 2.5 sec to 30 sec or 80 sec is a big jump.

If you know your application is always going to run on AWS EC2 environments (that includes Lambda, Elastic Beanstalk etc), and you don’t plan to write into a s3 bucket in a different region, then go with appropriate AWS S3 API. If your need is writing a single, small to medium size (< 200 MB) file, you are fine with simple S3Client put operation. However, if you have to deal with multiple files, files of varying sizes (including very large files > 5 GB), or directory upload, go ahead with Transfer Manager option.

If your application needs to run on AWS cloud as well as on premise and wants to leverage on S3, then go ahead with Storage Gateway > File Gateway options — you have plenty of them including an Appliance. In terms of performance, using Storage Gateway in EC2 environment is comparable to writing to S3 using native AWS SDK APIs (~3.9 sec). Also, this will allow you to continue to use Java File IO/NIO APIs in your application. For an existing applications, you save precious engineering bandwidth by avoiding to re-write your application to use S3 APIs. This will also allow you to use offerings like Dell-EMC — ECS Object Storage as NFS mounted drive. Dell-EMC ECS Object Storage (and others in this category) can work with Object Stores from any vendor — Amazon, Google, Azure, IBM. So, with Dell-EMC ECS Object Storage solution and it’s likes, you can avoid vendor lock-in. Last, with all these solutions, your application can update the same file multiple times using the files in cache instead of reading a File (object) from Object store for every small change.

Conclusion:

If your Application runs in EC2 instances in the same region as S3 bucket, then both natively writing to S3 or using the Storage Gateway > File Gateway perform almost at same level ~4 sec for a 224 MB file.

If your application runs in your own data center or in a different AWS region than S3 bucket, then writing to S3, using AWS SDK APIs, is not an option due high network overhead. You must go with either File Gateway or one of the appliances based S3 storage.

If your application requires re-writing multiple times into an existing File, then DON’T use S3. A jump from 0.3 sec to 3.9 sec actually is 13 x jump in time to write. Simply put S3 is not meant for low latency file writes. AWS S3 is good for write-once-read-multiple_times (WORM) usage. No one uses S3 for files used by OS or DB ;).

Source code used for this article is available under https://github.com/satyapaul/AWSS3Clients . All of them are sample code shared on AWS website, I made some minor changes.

Appreciate your comments, feedback.

--

--