Process large files line by line with AWS Lambda

Using Serverless FAAS capabilities to process files line by line using boto3 and python and making the most out of it

Published in

Analytics Vidhya

3 min readApr 28, 2020

Think of large physical servers for executing your workloads and the image above will come into your mind. Now think of purchasing these huge servers for processing your data, not really a good option, Right ?

Why not leverage the servers from cloud and run our workloads over cloud servers ? Great idea, but another problem, now we have to manage our workloads and also care that we shut the servers down at the right time in order to avoid additional cost. Nobody wants to pay for the things unnecessarily. Why can’t we have something that we need not to manage? Why can’t we pay for what we use? Why can’t we pay for the time when the servers are being utilized?

Well, there comes the serverless paradigm into the picture. You don’t want to purchase huge servers. You don’t want to be charged for the time when your server was not utilized. You want only specific memory for a particular workload. Going Serverless is the answer to all your queries.

Serverless doesn’t mean your programs will work without servers instead whenever you require server, it’ll be made available to you at minimum optimal cost and you will be charged only for the time your program is being executed. So, technically servers are not going out of picture, they are just abstracted so that we focus more on our programs rather than the server management.

AWS Lambda is serverless FAAS(Function As A Service) which gives you capability to run your programs without provisioning physical servers or leveraging servers from cloud.

Lambda functions though very powerful comes with few limitations of their own:

Lambda function cannot run more than 15 minutes.
Lambda function cannot use memory greater than 3GB

To read the file from s3 we will be using boto3:

Lambda Gist

Now when we read the file using get_object instead of returning the complete data it returns the StreamingBody of that object.

{
    'Body': StreamingBody(),
    'DeleteMarker': True|False,
    'AcceptRanges': 'string',
    'Expiration': 'string',
    'Restore': 'string',
    'LastModified': datetime(2015, 1, 1),
    'ContentLength': 123,
    'ETag': 'string',
    'MissingMeta': 123,
    'VersionId': 'string',
    'CacheControl': 'string',
    'ContentDisposition': 'string',
    'ContentEncoding': 'string',
    'ContentLanguage': 'string',
    'ContentRange': 'string',
    'ContentType': 'string',
    'Expires': datetime(2015, 1, 1),
    'WebsiteRedirectLocation': 'string',
    'ServerSideEncryption': 'AES256'|'aws:kms',
    'Metadata': {
        'string': 'string'
    },
    'SSECustomerAlgorithm': 'string',
    'SSECustomerKeyMD5': 'string',
    'SSEKMSKeyId': 'string',
    'StorageClass': 'STANDARD'|'REDUCED_REDUNDANCY'|'STANDARD_IA'|'ONEZONE_IA'|'INTELLIGENT_TIERING'|'GLACIER'|'DEEP_ARCHIVE',
    'RequestCharged': 'requester',
    'ReplicationStatus': 'COMPLETE'|'PENDING'|'FAILED'|'REPLICA',
    'PartsCount': 123,
    'TagCount': 123,
    'ObjectLockMode': 'GOVERNANCE'|'COMPLIANCE',
    'ObjectLockRetainUntilDate': datetime(2015, 1, 1),
    'ObjectLockLegalHoldStatus': 'ON'|'OFF'
}

You can find it here.

This streaming body provides us various options like reading data in chunks or reading data line by line. For all the available options with StreamingBody refer this link.

When we run below command we read the complete data by default which we need to avoid at all cost.

reponse['Body'].read()

As per the documentation, I suggest avoid using:

read(amt=None): Read at most amt bytes from the stream. If the amt argument is omitted, read all data.

and instead prefer

iter_lines(chunk_size=1024): Return an iterator to yield lines from the raw stream. This is achieved by reading chunk of bytes (of size chunk_size) at a time from the raw stream, and then yielding lines from there.

iter_chunks(chunk_size=1024): Return an iterator to yield chunks of chunk_size bytes from the raw stream.

Now since the complete object is not returned as soon as we run get_object, it opens up a world of new possibilities to do with the lambda. Now we can chain multiple lambda function with the help of step function or we can also pass the value from one lambda to another by setting up an s3 bucket event.

This allows data engineers to perform many tasks at the minimal cost incurred. Hope you liked this article.

Stay tuned for more content.

References:
[1] Boto3 Documentation

[2] Response Reference Documentation

Process large files line by line with AWS Lambda

Using Serverless FAAS capabilities to process files line by line using boto3 and python and making the most out of it

Written by Shubham Jain