Upload files to AWS S3 using Apache Flume

Published in

inspiringbrilliance

3 min readAug 21, 2020

When you choose Apache flume, there is no out-of-the box S3 sink available (at least till the date of the post). But there is one option available, to upload files to s3, which is HDFS sink. This sink uses Hadoop’s AWS module to do so. You can explore more about this module here.

Apache Hadoop provides following 3 file systems for reading and writing data to S3.

S3N (URI scheme: s3n) : A native filesystem for reading and writing regular files on S3. S3N requires a suitable version of the jets3t JAR on the classpath. Max file size supported is 5GB.
S3A (URI scheme: s3a) : Hadoop’s successor to the S3N filesystem. It supports partitioned uploads for many-GB objects (more than 5GB). It uses Amazon’s Java S3 SDK with support for latest S3 features and authentication schemes and many more features… Stable support of S3A is added in Hadoop 2.7 and later. Its previous connectors “s3”, and “s3n” are deprecated and/or deleted from recent Hadoop versions.
S3 (URI scheme: s3) : Apache Hadoop implementation of a block-based filesystem backed by S3. Apache Hadoop has deprecated use of this filesystem as of May 2016.

The Pain

Even though above sentences sound promising and encouraging, using HDFS sink to upload files to S3 is very painful, if you don’t know which version of aws libs, Hadoop libs and flume to use. It, for me, took almost a day to figure correct versions.
After versions then comes issue of AWS secrete keys. You can give S3 path to HDFS sink as below

agent.sinks.sinkName.hdfs.path = s3a://<aws_access_key>:<aws_secret_key>@testbucket

If secrete key contains / then this URL does not work. This is an open issue in flume. You have to generate secret key which does not have slash in it.

AWS region signing protocol version issue : Latest AWS regions, use V4 signing process. Because of this, above hdfs s3 path will not work out-of the box. You have to provide fs.s3a.endpoint setting to Hadoop. Detail of this issue can be found here. List of regions supporting older signing process can be found here. For Mumbai region, I had to provide fs.s3a.endpoint. This new process is added by AWS for enhanced security.
How to automate providing all required libraries with suitable version to Flume

The Solution

After facing all above problems and countless google searches, I was finally able to configure flume to upload files to S3 using HDFS sink, successfully. Solution goes as below.

Download and extract Apache Flume1.9.0
Set JAVA_HOME. I used java 11.0.2.
Add below pom file in root folder of flume (along side /bin folder). This pom will help you to download all necessary libraries to /lib folder of the flume.

Go to flume root directory from command line (to the location where pom.xml is located). And run below command.

mvn process-sources

Add core-site.xml file with below minimum configuration and save it to/conf folder under flume (along side flume properties file). With this way, you get support for AWS region V4 signing process and solution for slash in aws secret.

Add hdfs path as below (without secret and access keys) in flume config file

agent.sinks.sinkName.type = hdfs
agent.sinks.sinkName.hdfs.path = s3a://testbucket

Entire folder structure is as follows

Now run flume with below command from root folder of flume

bin/flume-ng agent -n agent -c conf -f conf/flume-conf.properties -Dflume.root.logger=INFO,console -Dflume.monitoring.type=http -Dflume.monitoring.port=34545

If every thing goes well, you should be able to upload files to S3 using Apache flume.

References

Upload files to AWS S3 using Apache Flume

The Pain

The Solution

Written by Amit Sadafule