How to enable pandas use with S3 in AWS lambdas

Brandon Odaniel
Xylem | AI and Big Data
6 min readApr 24, 2024

by Yu Yang, Ph.D., PE, PMP

https://unsplash.com/photos/panda-bear-on-tree-trunk-Fmkf0HZPPsQ

When you are working with a large dataset (e.g., larger than 500 MB) locally with a Jupyter Notebook, it is fairly easy to work with after you have installed pandas and other packages within the Jupyter environment. However, for say a web application, if you need to upload a large dataset and conduct some transformations, an AWS Lambda function could be the right fit for your needs. This blog post describes how to do this by uploading a dataset to an AWS S3 bucket, invoking a Lambda function to convert the dataset (e.g., from a .xpx to a .csv file), and then saving the transformed dataset to another S3 bucket for your web app.

Instructions

We will need two S3 buckets for dataset transformation. One will be used to upload the dataset, and the other will be used to save the transformed dataset. Once the two S3 buckets are created (best practice to make them private), we can begin the process of creating the Lambda function.

1. Create a Lambda Function with a Blueprint

After you log into the AWS console, you can switch to the Lambda page. Simply click the button “Create Function,” which is highlighted in yellow. This will take you to the next page for function creation.

Figure 1

From Figure 2 below, you can choose “Use a blueprint.” We will be able to select “get s3 object” from the dropdown menu under Blueprint name. We can use Python 3.10 as the runtime, and later, if necessary, change it to a lower version such as Python 3.9. Certainly, we need to assign an execution role to the Lambda function so that it can access the S3 bucket for read/write operations

Figure 2

Since the lambda function is triggered by a S3 bucket, we can set up a triggering S3 bucket event per Figure 3 below. In our case, we used “multipart upload completed” as the triggering event.

Figure 3

In the code part of lambda function, you will then use something akin to the below the triggering event.

def lambda_handler(event, context):
#extracts the name of s3 bucket from the event object.
#When a file is uploaded to an s3 bucket, an event is generated and sent to lambda
bucket = event['Records'][0]['s3']['bucket']['name']
#extract key from event
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')def lambda_handler(event, context):#extracts the name of s3 bucket from the event object.#When a file is uploaded to an s3 bucket, an event is generated and sent to lambdabucket = event['Records'][0]['s3']['bucket']['name']#extract key from eventkey = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')

2. Add Pandas Layer to the Lambda Function

After typing your code into the Lambda function, you will need to add a layer if you use the Pandas library in your code. I would personally recommend using a private AWS EC2 instance with the Amazon Linux system for this task. By navigating to the EC2 instance page through the control console, you can launch an EC2 instance with t2.micro (which is very cost-effective) using Amazon Linux. There is no need to set up any IAM role or Auto Scaling Group for this instance. Once connected to the EC2 instance, we can proceed to create the Pandas layer.

Once your EC2 instance is launched, create a virtual environment, and install Pandas within the EC2 instance environment. Briefly, we create a directory named python4lambda, and then create a virtual environment called layer4lambda using the following Linux commands:

mkdir python4lambda
cd python4lambda
python3 -m venv layer4lambda
source layer4lambda/bin/activate
pip install pandas
deactivate

We can now create another directory, which will be used to store the pandas environment file.

mkdir -p python/lib/python3.9/site-packages/
#copy the pandas environment to the target directory
cp -r layer4lambda/lib/python3.9/site-packages/* python/lib/python3.9/site-packages/mkdir -p python/lib/python3.9/site-packages/#copy the pandas environment to the target directorycp -r layer4lambda/lib/python3.9/site-packages/* python/lib/python3.9/site-packages/

We will zip the contents of the Python directory and then upload it to the AWS account where your Lambda function is located. Surely, you will need to configure the AWS credentials related to your AWS account.

zip -r panda_layer.zip python
cd python4lambda
aws lambda publish-layer-version --layer-name pandas --zip-file fileb://panda_layer.zip --compatible-runtimes python3.9

The default Python version on Amazon Linux is 3.9. Therefore, we specify --compatible-runtimes python3.9. We can change the runtime for the Lambda function on its configuration page from Python 3.10 to Python 3.9.

3. Add S3fs Layer to the Lambda Function

Since we are handling large datasets from S3 buckets, our preference is to use the s3fs package. S3fs can be more memory-efficient for certain tasks because it allows for streaming data directly to and from S3 without needing to load everything into memory.

Following the same steps showed above in #2, create another EC2 instance environment to install s3fs and upload it as another layer to the Lambda function. We do this separately from the Pandas layer due to the size limit of the zip file, which can be pushed to Lambda (<= 50 MB). Don’t forget to add layers to your Lambda function by scrolling down on the Lambda page and clicking the button “Add a layer.” You will see the added layers in the diagram of the Lambda “Function Overview”.

4. Use S3fs to Handle the Dataset Saving to Another S3 Bucket

Now, we can test the implemented Lambda function.

Unfortunately, it failed when I tried to test it by uploading a large file to the S3 bucket (due to a memory leak!). My initial response was to increase the memory of the Lambda function to the maximum value of 10240MB, with a timeout set to 15 minutes.

However, it still displayed an error message about a memory leak when the dataset I attempted to transform was approximately 700 MB.

Eventually, the solution was to write the dataset to the target S3 bucket in chunks. An example of the code is provided below.

# Create an S3fs file system object
fs = s3fs.S3FileSystem()
#Example to process and write data in chunks to reduce memory usage
chunk_size = 5000 # Define a suitable chunk size
total_rows = len(value_for_cols)
for start in range(0, total_rows, chunk_size):
end = min(start + chunk_size, total_rows)
chunk_values = [YOUR_function(value) for value in value_for_cols[start:end]]
chunk_data = np.array(chunk_values)
chunk_df = pd.DataFrame(chunk_data, columns=column_name_unit)
mode = 'w' if start == 0 else 'a'
header = True if start == 0 else False
with fs.open(s3_output_file_path, mode) as f:
chunk_df.to_csv(f, index=False, header=header, encoding='utf-8')
# Clean up chunk to free memory
del chunk_values, chunk_data, chunk_df
gc.collect()# Create an S3fs file system objectfs = s3fs.S3FileSystem()#Example to process and write data in chunks to reduce memory usagechunk_size = 5000 # Define a suitable chunk sizetotal_rows = len(value_for_cols)for start in range(0, total_rows, chunk_size):end = min(start + chunk_size, total_rows)chunk_values = [YOUR_function(value) for value in value_for_cols[start:end]]chunk_data = np.array(chunk_values)chunk_df = pd.DataFrame(chunk_data, columns=column_name_unit)mode = 'w' if start == 0 else 'a'header = True if start == 0 else Falsewith fs.open(s3_output_file_path, mode) as f:chunk_df.to_csv(f, index=False, header=header, encoding='utf-8')# Clean up chunk to free memorydel chunk_values, chunk_data, chunk_dfgc.collect()

Conclusion

Congratulations! You have now created a process by which:

1) A large file is dropped into an input S3 private bucket

2) Once the file is finished its upload into the bucket, a lambda function is triggered.

3) That lambda function possesses both a pandas layer and s3fs layer. Both of which can be utilized within your lambda file to perform python code using pandas interacting with S3.

Data Scientists using AWS rejoice! 😊

--

--

Brandon Odaniel
Xylem | AI and Big Data

Brandon O’Daniel, Data Geek, AI/Big Data Director @ Xylem, MBA, Proud Dad of three awesome kids :)