AWS Lambda has a limitation of providing only 500MB of disk space per instance. This limitation rules it out for those pipelines that requires you to process single large files
If you’re pipeline includes processing lots of data and you need a way of processing large (>500MB) files (in this example case, large zip files) from AWS S3, then you’ve probably not used AWS Lambda in your solution.
Your large files maybe zip files that are < 500MB in size, but when extracted, totals more than 500MB.
Either way, you’ve hit the limit of Lambda.
Have no fear, there is a solution.
Do not write to disk, stream to and from S3
Stream the Zip file from the source bucket and read and write its contents on the fly using Python back to another S3 bucket.
This method does not use up disk space and therefore is not limited by size.
The basic steps are:
Python 3.6 using Boto3
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket_name_here", key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
There are always gotchas.
AWS Execution time limit has a maximum of 15 minutes so can you process your HUGE files in this amount of time? You can only know by testing.