Implementing a Serverless Batch File Processing Application

Published in

Think Serverless

7 min readFeb 11, 2018

Photo by Mr Cup / Fabien Barral on Unsplash

Processing batches of files in scheduled time intervals is one of the most commonly used automated tasks of today’s enterprise application systems. This type of batch file processing can be used for various use cases, such as creating backups, analyzing logs, performing calculations and etc.

In this article we are going to build such a batch file processing solution that fulfills the following requirements.

An external application uploads files to a preconfigured location at random time intervals (which is out of scope of our implementation).
Our application checks this file location at 1-hour intervals and processes all the files currently exist there, one-by-one.
After successfully processing a file, it will be deleted to prevent duplicate processing.
After all the files of the current batch is processed, a notification with the processing details (number of total processed files, number of successfully processed and number of failures) will be sent to a pre-configured email address or multiple addresses.

Now we have 2 main approaches to build this solution. One is the traditional approach and the other one is the modern serverless approach. Let’s first briefly look into the traditional approach and its drawbacks. And then use the latter approach to build this solution.

Traditional Approach

If we consider the traditional approach, the full solution will consist of the following components.

An SFTP server to which the files are uploaded by the external application and from which our application will read them.
An email server or a third party email service provider which is used by our application to send out notification emails.
An application server which will run our application.
Our application, which will have a cron task scheduled to trigger a certain routine at 1-hour intervals and this routine will fetch the files list from the SFTP server, processes each of them and finally sends the notification email.

Although this approach works very well and had been used for many years, there are few drawbacks associated with it. The major one of those is the cost. Here you need to maintain an SFTP server, an application server and probably an email server as well. Even though the external application haven’t uploaded any files even for a day or two, you still have to keep the SFTP server running 24x7 and pay for it as you don’t know when a file will be uploaded. Same goes for the application server. Depending on the kind of processing you do, a single batch process may take only a couple minutes. So until the next batch process is triggered, your application server is not doing any effective work, unless you deploy some of your other applications in it as well.

In addition to these you also need to develop a bit complex application, which will not only contain your processing logic, but also a significant application logic related to interacting with those other systems such as SFTP and email.

Serverless Approach

But the serverless paradigm, which is becoming more and more prominent nowadays, makes this situation surprisingly simpler and cost effective. For example, if you use AWS S3 for file storage, you only have to pay for the size of the files. And if you use AWS Lambda for processing, you only need to pay for the actual processing time, not a cent for the idle time.

So let’s see how we can develop the above solution with serverless components, specifically with AWS components.

Serverless batch file processing application architecture

The above diagram shows how we can integrate the AWS components to build our solution.

An S3 bucket is provisioned to store files until they are being processed.
An SNS topic is configured to publish processing notifications and the required email addresses are subscribed to it.
A Lambda function is programmed with necessary permissions to read the files from the S3 bucket, process them, delete the processed files and finally to send a notification to the SNS topic.
A CloudWatch scheduled event is configured to trigger the lambda function at 1-hour intervals.

Creating the S3 bucket

To create the S3 bucket, login to your AWS account and go to S3 management console.

If you don’t have an AWS account, no need to worry. You can create one absolutely for free!

Then click on the Create Bucket button, provide a unique name for the bucket and click Create.

After the bucket is created, select it to open the details panel. Then click on the Copy Bucket ARN button to get the ARN and note down it somewhere.

Creating the SNS Topic

As the next step, go to SNS (Simple Notification Service) management console and click on Create Topic sub heading. Then provide a topic name and a display name (which should be less than or equal to 10 characters in length) and click Create Topic button.

When the topic is created, note the ARN at top of the topic details page and then click on Create Subscription button to add your email to receive notifications. In the Create Subscription form, select Email as the protocol and provide your email address as the endpoint. When you click create subscription after that, you will receive an email with a confirmation link to the provided email address. Click on that link to confirm the subscription.

Creating the Lambda

Then go to the Lambda console and click on Create Function button to create a new lambda. Provide a suitable name for your lambda function and choose Node.js 6.10 as the runtime (you can select any other runtime as well. But in this article I’m providing the code only for Node.js). Then from the Role dropdown select Create a custom role and it will open AWS IAM console in a new tab.

In that, provide a suitable name for the new role and click Allow to create a role with basic lambda execution permissions. The select Choose existing role from the Role drop down and select the newly created role from the next dropdown. Finally click Create Function to create the lambda.

Writing the lambda application logic

The next thing we need to do is writing the application logic of lambda to achieve our intended behaviour. Following is a sample code which does exactly that.

In this code, we first get a list of files that are currently in the S3 bucket using AWS S3 SDK’s listObjects() method. Then each file is processed according to a custom logic (I have only logged the file name in this sample), and finally deleted using deleteObject() method.

At each final stage (file processing finished, file processing failed or no files to process), we use AWS SNS SDK’s publish() method to send a notification to the SNS topic.

After adding this code to your lambda function, click Save button at the top to persist the changes.

Granting required permissions to Lambda

Although we have programmed our lambda to access the S3 bucket and publish to the SNS topic, it still doesn’t have the necessary permissions to do so. If you remember, when we created the lambda, we also created a new role with basic lambda execution permissions. We need to add these additional permissions to the same role to make lambda work correctly.

For that go to the IAM console and edit the newly created role. And then add a new inline policy with the following permissions and attach it to the role.

For S3 Bucket — s3:ListBucket, s3:DeleteObject
For SNS Topic — sns:Publish

You can either use the AWS visual editor or directly edit the JSON if you are familiar with the structure. However, the final permission configuration should be similar to the following.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "sns:Publish",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:sns:us-east-1:<YOUR_ACCOUNT_ID>:BatchProcessNotifications",
                "arn:aws:s3:::batch-process-bucket-udith"
            ]
        }
    ]
}

Configuring the CloudWatch Event

The last step is to add a CloudWatch scheduled event which will trigger the lambda in 1-hour intervals. For that go to the CloudWatch console, click on Events on the side bar and click on Create Rule button.

Then select Schedule as the event type and 1 Hour as fixed rate. Add the previously added Lambda function as a target and click on Configure Details button. After that, provide a name for the rule (and probably a description as well) and click Create Rule.

Now we have successfully completed our serverless batch processing application. So you can upload some files to the newly created S3 bucket and check whether you will get an email notification after they have been processed. (For that, you may need to edit the CloudWatch trigger and reduce the schedule to a shorter period such as 2-minutes or wait for an hour for the event to be triggered :))

If you are interested, you can refer the 2nd part of this article on how to build the same serverless solution with much less time and effort.

Call To Action

Clap. Appreciate and let others find this article.
Comment. Share your views on this article.
Follow me. Udith Gunaratna to receive updates on articles like this.
Keep in touch. LinkedIn, Twitter