Manipulating a Large CSV File with Symfony, JS, and Python

Published in

Simform Engineering

4 min readMar 13, 2023

Efficiently transform and manage massive data sets by leveraging efficient AWS services.

This post demonstrates the use of AWS SDK and Lambda Functions written in Python to work with large CSV files, both effectively and efficiently.

We recently worked with a client that wanted to solve this problem for their PHP-based web app.

Specifically, their requirement was to build a PHP-based web app that would let a user select a full-size CSV file ranging from 500 MB to 1 GB or more, alter it, and reorganize its headers keeping its data rigid. Ultimately, the user should get back the CSV file with the manipulated column headers to get downloaded on their device.

Prerequisites

Install the package in your Symfony project.

# composer.json
"require": {
  aws/aws-sdk-php
}

Let the user upload a large CSV file

Selecting a file and passing it to the server via PHP as the backend and then dealing with that file is extremely time-consuming. Instead, a more efficient way to do it is to simply upload the file directly to the AWS S3 Bucket with the help of a super cool AWS SDK for JavaScript to use from the browser. 👀

But, before the browser script can access the Amazon S3 bucket, you must first set up its CORS configuration (Cross-Origin Resource Sharing) in the Amazon S3 console as mentioned in their official documentation.

Getting the column names from S3 Bucket

In your controller, initialize your AWS parameters in the constructor. Replace the AWS-related values with your environment variables.

The form will be submitted to path('app_upload_file')where you can register the stream wrapper for S3 to access buckets and objects stored in Amazon S3 using thes3:// protocol. Open the object URL, read that CSV to extract column headers, and pass the columns to render them on a new page.

Using Lambda for getting column names is not suggested as it is a time-taking process and no one wants to make their users wait till the data is being fetched.

Present columns in the front-end and adjust them as per your requirement.

Now, generate a new CSV having user-modified columns using the AWS Lambda function. Pass the JSON-encoded columns to the lambda function as its Payload. Get the URL of the newly created file in the response returned and pass it to the template which will later start downloading it.

Eager to know how the AWS Lambda function can be implemented for this purpose? Let’s find out!

Using AWS Lambda

As many of you might know, AWS Lambda is a serverless function that charges users according to the function’s execution time and memory that we have preassigned to the function.

Let’s start from the beginning: First, create a simple lambda function from your AWS account giving configuration like compatible runtimes(python3.8), compatible architectures(x86_64), memory, ephemeral storage, and timeout.

You have to add additional libraries like Dask (Python) and s3fs (Pythonic File Interface to S3) through Layer and then add these two layers to your Lambda function.

Note: Make sure you give the same configuration to your layer (when creating it), otherwise it will not be added to the function later.
For example, a Python version mismatch.

A Lambda function comes with the basic Hello World code. You can look at the event parameter of a Lambda function. It will give a dictionary having the value of an event when Lambda is called, like bucket name, file key, and content-type.

We can create up to 10 test cases for testing our Lambda function. Configuring Test Event allows you to configure test events that will come to event parameters as a dictionary. So here is our example of it.

You have to give different permissions to your function’s AWS Role, like cloudwatch, AmazonS3FullAccess, and AWSLambda_FullAccess.

Now let’s come to the coding part for CSV

Dask is faster than Pandas when it comes to loading larger CSV files and processing them. It only creates the graph structure first. Even after some changes to the dataframe, it makes changes to that graph. It takes time when, at last, we apply the compute() function to generate the final CSV file. So instead, we will use Dask here.

Reading a CSV file from S3 with the help of Dask in a Lambda function:

Now, update data from the Dask dataframe, generate a new CSV, and upload it to the S3 bucket.

Generate the S3 object URL and return it in response to calling a Lambda function.

Finally, the download.html.twig has implementation like creating an AWS S3 reference, getting the object using getSignedUrl with Bucket and File details as parameters and explicitly clicking the URL returned from the function mentioned above.

And there… you have it!

Key Takeaways

Uploading a large file on the S3 bucket directly from the browser is more efficient than passing it to the backend.
Python is a popular programming language for data processing, thanks to its extensive library support, ease of use, and flexibility.
AWS provides a range of powerful services that are ideal for storing and performing operations on large files, including Amazon S3 and AWS Lambda.

Conclusion

Together with Amazon’s S3 bucket and Lambda functions, large-sized CSV files can be easily handled. Columns can be modified via Javascript and are easily accessible in Symfony’s controller. Later, AWS Lambda can be invoked from there without any hurdles to generate the changed file.

To stay up-to-date with the latest trends in the development ecosystem, follow Simform Engineering.

Happy Coding! :)