File processing with Apache Camel & AWS services

Rakesh Ravlekar
Globant
Published in
5 min readDec 23, 2020

It was our regular sprint until we got the upcoming work for some file processing. We thought it might be some csv or xl file processing work we worked in the past and started checking the initial jiras.

So we had:-

  1. Multiple source systems generating different input files once in a day.
  2. Each file with different structure containing fixed width characters data, header, footer and row count not extending 30k.
  3. Source and generated output files to be uploaded at some location.
  4. Ability to trigger email/or any other notifications to Operations team for failed records.

Initially we thought the solution would be something like below, where we bundle the code as a jar and run on a server maybe with cron job or a standalone application. Maybe we could have added some messaging queues.

But we were supposed to build the app as cloud native on the existing client’s infrastructure of AWS. This wasn’t the perfect solution to address the security (Directory permission, SMTP, DB access) cost, storage and wasn’t using any of the existing AWS services.

Also as the files were having fixed width characters without any delimiters even the POI or Spring batch couldn’t fit.

So how to solve this? And here we began learning new concepts about Apache Camel and AWS services.

To start with this was a perfect case for EIP (Enterprise Integration pattern) and Camel was the best fit, why ?

  • Camel offers well versed APIs to support different source/destination endpoints so you can easily integrate and change. The endpoints can be any ftp location, jms endpoint or even AWS S3 buckets.
  • It has AWS S3 component which offers excellent support for processing files from reading and writing to AWS S3.
  • Camel Bindy has support for different file formats including fixed width (which was our use case), this helped to avoid any boilerplate code to parse the delimited files, header,footer etc.
  • With Camel you can easily write Junits to test the file processing code.
  • We also observed Camel has excellent community support where we got prompt response for one of our file encoding issues. Link

So far good, but then we had to decide how we invoke the file processing code or deploy on AWS. For deploying the application, we had two options with AWS ECS (Elastic Container service): EC2 (Elastic compute cloud) or Fargate.

In general ECS is a good choice to make a containerized application and run it on an EC2 cluster, you can refer to the AWS documentation link . ECS launches your containers in your own Amazon VPC, allowing you to use your VPC security groups and network ACLs. No compute resources are shared with other customers. You can also assign granular access permissions for each of your containers using IAM to restrict access to each service and what resources a container can access.

Amazon EC2 offers the broadest and deepest compute platform with choice of processor, storage, networking, operating system, and purchase model.

It was decided to go with the EC2 approach, as the client team preferred to have control over configuration of EC2 for security reasons and EC2 pricing model is more cost effective. This article perfectly describes the difference between them link.

But the next question was, how do we run these ECS tasks? And here comes AWS Lambda with S3 notification event.

AWS Lambda is an event-driven, server-less computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. Lambda has a span of max 15 mins of execution time and as it is server-less we pay only for the execution time, and based on the number of requests, memory allocated to lambda function.

For storing files the obvious choice was Amazon Simple Storage Service — S3. It is an object storage service that offers industry-leading scalability, data availability, security, and performance. It also supports versioning and offers various storage classes based on the business need. Check the detail here

The Amazon S3 notification feature enables you to receive notifications when certain events happen in your bucket. To enable notifications, you must first add a notification configuration that identifies the events you want Amazon S3 to publish and the destinations where you want Amazon S3 to send the notifications.

This was what we needed so instead of pooling and waiting for file arrival, we used ‘New object created S3 events’ so as soon as the file arrives on S3 bucket, it triggers lambda function ( We used Java code but it can be written in python or other languages supported by AWS) which in turn runs the ECS task asynchronously.

Hope below diagram explains the final end to end flow clearly.

For email notifications we are planning to use AWS Simple email service which is a cost-effective, flexible, and scalable email service that enables developers to send mail from within any application.

Challenges we faced :-

  1. Apache Camel has a learning curve if you are first time user and there are different ways to achieve the same results (XML vs annotations, different bindy formats, exception handling) so it was bit tedious to debug and come up with a running model.
  2. It was time consuming to test the entire flow on AWS Dev environment but fortunately we used Localstack to simulate the AWS setup on our local machines which helped a lot to speed up our testing. Again localstack has some issues if you have to run it on a Windows machine but we could manage it with docker desktop.
  3. We had an issue with retaining the accented characters in the files, there was a bug in Camel APIs which got fixed in 3.6.0 release but still it wasn’t behaving correctly on AWS. After further testing we could resolve it by changing the AWS platform default lang to LANG=en_US.UTF-8.
  4. Earlier we didn’t set any timeout for camel code to end once the file processing was done which caused multiple ECS tasks running in the background. But post we specify the max idle time with camel configurations we could resolve this.
  5. For some complex files we recently observed the serial file processing was bit slow, so now we have added parallel processing support which Camel provides, read more on threading model here.
  6. Recently we had to read and generate a large file ( > 1.5 Gb) for which we had to use getByteRange/multiPartUpload S3 APIs. You can read more details on this in this article written by my colleague here

So this is how finally we could make the file processing simplified with Camel and AWS services, hope you liked the write-up.

Please leave me a comment for any questions/comments you have!

--

--