Processing Large S3 Files With AWS Lambda

HangC
The Startup
Published in
4 min readAug 5, 2020

--

Despite having a runtime limit of 15 minutes, AWS Lambda can still be used to process large files. Files formats such as CSV or newline delimited JSON which can be read iteratively or line by line can be processed with this method.

Lambda is a good option if you want a serverless architecture and have files that are large but still within reasonable limits. It is possible to write a lambda function that can process a large csv file with the following approach. Our lambda function will be capable of handling data sizes exceeding both its memory and runtime limits. The approach can be also be easily extended to handle file formats such as newline delimited JSON.

The main approach is as follows:

  1. Read and process the csv file row by row until nearing timeout.
  2. Trigger a new lambda asynchronously that will pick up from where the previous lambda stopped processing.

We will define the following event which will be used to trigger the lambda function. The use of bucket_name and object_key is necessary to identify the S3 object that will be processed, the use of the offset and fieldnames will be covered shortly.

{    
"bucket_name": "YOUR_BUCKET_NAME",
"object_key": "YOUR_OBJECT_KEY",
"offset": 0,
"fieldnames": None
}

--

--