AWS Lambda: Easily process millions of files at low cost.

Or how we saved design and coding time by applying the right solution to our problem.

Photo by j zamora on Unsplash

The Problem

At FairFly, like many other companies, we securely store our historical data in an AWS service called S3.

Amazon S3 (Simple Storage Service) is a web service offered by Amazon Web Services (AWS). Amazon S3 provides storage through web services interfaces. (source: Wikipedia)

Generally, you drop your files there and after a while, your buckets amass a huge amount of files.

One of the Faifly buckets looks like this:

Already few millions of files in just one bucket

Recently, we had a task to reprocess many of these files. Of course, we searched for the fastest and cheapest solution.

This is a walkthrough of our thinking process which eventually led us to the conclusion that AWS Lambda allows us to accomplish this super fast and also significantly cheaper!

So, we started brainstorming how to process these millions of XML files…

1/ Basic — Download, Process, Upload

In essence, this option consists of downloading all files locally, processing them and putting them back to S3.

For this option, you would need a lot of free space on your hard disk (200Go in our case) and some patience…

How much time would it take?

$ time aws s3 cp --quiet s3://test_bucket/test_smallfiles/file.xml .
0.42s user 0.22s system 28% cpu 2.294 total

Let’s say 2 seconds to copy one file locally and 2 more seconds to upload it back, text processing time being insignificant. So, for 4,500,000 files it would take around 200 days.

2/ Parallelize the Basic solution

We could scale the above solution horizontally but of course, it goes with a list of constraints and issues that we have to deal with.

Parallelize

According to which variable should we split the process? How many threads? Should all files be sorted to be dispatched efficiently?

It will require time before reaching some acceptable solution using trial and error method. Moreover, it would take coding which is something we try to avoid when working on non feature related tasks.

Failure recovery

In case one of the tasks fails or is stuck, you need to know which files have been already processed and continue the processing from there once it recovered. Again, more code is needed here… Did you ever try to implement such a failure recovery mechanism?

Time

First, we got the processing time. Let’s suppose that a pool of 100 threads will divide the total time by 100 (it will not but I am a fair player), we got 2 full days of processing.

Second, we got the development time. Between creating the infra, choosing some programming language and writing the script that will parallelize the whole stuff, it can cost more few days.

Memory & CPU

It is well known that text processing is expensive in resources. Even if it is simple processing, the machine resources limits could be quickly reached. So we’ll need either a strong machine or multiple machines. Of course, multiple machines increases complexity and cost!

Cost

Under this conditions, a pretty strong machine must be used. Based on processing time, it could be quite expensive.

Think about all this! 🤯

The main thing is not the issues enumerated above. One by one they can be handled. But why should you deal with all this if you don’t have to?

3/ Use a distributed computing framework

It is comparable to the previous solution but a good framework could be done way more efficiently. In our case we could consider using some framework like Apache Spark.

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. (source: Wikipedia)

Still, you still need a big cluster to get this done in a reasonable time and it will cost boatloads of money. 💵💵💵

4/ Go Serverless with AWS Lambda!

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of the Amazon Web Services. It is a compute service that runs code in response to events and automatically manages the compute resources required by that code. (source: Wikipedia)

When we at FairFly encounter these issues, where basically every solution sounds bad, we always try to go a step back and think if there is a simpler way to solve the problem.

Our preference is always writing less code, especially if it’s not part of the core problem we’re trying to solve.

So what could save us from:

  1. Glue code
  2. Provisioning resources
  3. Sync with s3
  4. Retry mechanism

AWS Lambda can do that so easily! How?

All you need is to create two things:

  1. Create your AWS Lambda function
  2. Create an AWS S3 trigger that will initiate the function each time a file is created in some bucket.

That’s it. No infra, no processes, no parallelization logic.

You just write code that does what you want to do.

Using this approach time scale is changed from days to hours.

Tutorial

  1. Create your Lambda function:
Click “Create function” button to create your AWS Lambda function… 🙄
Write your (Python) code that defines your processing of one event = one file

2. Define an S3 trigger

In our case, we want this trigger to call our function on any new XML file created in the processing-folder.

Define an S3 trigger for processing

3. Move your files to process

Each time that a file will reach the processing-folder, the file will be processed. To make this happen, just run a command that will move all files to the processing-folder:

aws s3 mv s3://fairfly-bucket/origin-folder/ s3://bucket/processing-folder --recursive
Lambda processing flow

Let’s be on the safe side…

You may ask yourself if this process is endless since we put back files to the origin-folder but actually the aws s3 mv command does not “reload” the list of files to move. It is defined once on the command execution.

However, if for some reason the command is interrupted and you need to relaunch it, it would be smart to not re-process the already processed files…

The best way to achieve this is to output your processed files not to origin-folder but to a 3rd folder processed-folder. Then, you can always move all your files to origin-folder if you will.

Output to processed-folder and moved all processed files to origin-folder

We’re done! 🎰

It took only several hours — that is actually the time to move all the files from the origin-folder to the processing-folder and back.

Conclusion

At FairFly, we always try to look for the simplest solution with as few moving parts as possible or the least coding-from-scratch.

For this task, we were pleased to discover that AWS lambda easily leveraged existing usage of S3 and helped us process millions of XML files in a quick and cost-effective manner. Using other technologies, it would have been a pain and anyway it would ended costing us thousands of dollars!

AWS Lambda is especially convenient when you already use AWS services but by setting the right hooks you can easily use it even if you chose another cloud platform.

Feel free to give your feedback, you can also contact me at noam@fairfly.com

Like what you read? Give Noam Elbaz a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.