Serverless ETL using AWS Lambda -Part 1
My intent with article is to showcase a simple example for AWS lambda service and show how AWS lambda functions can be used to get notified when files with specific pattern are put on an s3 bucket. In later parts of this series of article I will showcase applications of AWS Lambda for various things such as loading files into a the DynaboDB service, refreshing visualizations for d3js charts, building image processing modules, and so on. The motivation for this article comes from a situation I faced at workplace - a scheduled process that parsed files from s3 source bucket suddenly started failing when the frequency of file upload to s3 became erratic. The file processing script had to be taken off the schedule and run manually after checking that files are available on s3.
In similar situations one could make use of AWS Lambda function to trigger actions when a file is put on s3 or modified. The steps are simple and the end use of AWS Lambda function is only limited by your imagination. For example one could build an entire ETL process using AWS lambda function and scale up as necessary. This part will cover a basic notification of file upload and print file and bucket name.
What is AWS Lamda
AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume — there is no charge when your code is not running. For further reading visit AWS Lambda.
Prerequisites:
Since I make use of s3 to trigger the lambda function, we do need an s3 bucket. If you need an easy tutorial on how to create an s3 bucket, follow the link here Create an Amazon s3 bucket
Assuming you have set up an s3 bucket, let’s see how lambda function can be built. I think the easiest way to create AWS Lambda functions, especially if you are looking to just get your hands wet, is to create it using the AWS Console. Go to your AWS console, and click Lambda. In future articles, I will take up creating lambda programatically using AWS cli. For now, GUI is quick and easy.
In the functions menu, click Create function
You will be presented with 3 options — author from scratch, blueprints, and serverless application repository.
Author from scratch doesn’t do much other than a hello world kind of a code. Blueprints provide a set of templates from which you could choose prebuilt code for some basic lambda functionality. In this case we will choose s3-get-object-python template.
On the next screen there are a few things you have to take care of. On the basic information section, choose a name for your function. Make sure you choose create new role from template. This role defines the access level for your lambda function. Since we need the lambda to get information about an s3 bucket, the bare minimum is an s3 read-only permissions. Give this role a name and you may want to use it in future.
In the next section specify the options for the s3 bucket. Pick the name of the bucket you created in the prerequisites or refer that section for more information. In Event Type, choose Put so that lambda function is only triggered when a file is put on the s3 bucket. You can choose optional parameters to limit the trigger criteria such as a prefix for the file you will upload, and filter for file extensions.
Make sure you select Enable trigger to allow lambda to enable this trigger in s3 and automatically add the required permissions.
In the Lambda function code section, you may want to replace the print statement with a slightly more detailed version. This will print bucket nam, efile name and content type of the uploaded files if they meet the filter criteria defined above.
print(“Bucket name: “+bucket +”, key:”+key+”, CONTENT TYPE: “ + response[‘ContentType’])
Save the lambda function, and now it’s time to do a quick test. Create a small file, and give it a name that matches the filter criteria e.g. data123xyz.json. To upload this file to s3, go to s3 console and click upload
Click through the next set of s3 windows keeping the default options. If everything went alright, you should be able to see the results of the lambda function in cloudwatch logs.
Click on the log group for your lambda function. You should see the output from the lambda function. As you can see below, the 3rd line of the logs shows the bucket name, filename/key and content type from the print statement.
Now you can customize the lambda function as per your requirements. Hope this helped and please follow for future updates.