Optimizing Bulk Data load of CSV data in AWS Aurora RDS

Published in

Epsilon Engineering Blog

5 min readDec 27, 2023

Rohit Yadav- Lead Software Engineer, Product Engineering

Introduction: In data engineering related applications, we come across situations where we need to load large amount of data in the Relational Database Systems. In this context, we are considering Amazon Aurora as the RDS. Amazon Aurora is a relational database service developed and offered by Amazon Web Services. It is a fully managed relational database engine that’s compatible with PostgreSQL.

Bulk data loads can be a time-consuming and challenging task, especially when dealing with large amounts of data. The process requires a significant amount of effort and expertise to ensure that the data is loaded correctly and accurately.

Although there are many supported data formats like json, parquet, pickle in modern ETL pipelines, CSV format is the mostly widely used in many enterprises application for Data Migration from legacy systems to Aurora.

AWS Aurora allows users to load bulk data into the database in a quick manner, with usage of its extensions, which is requirement for many applications.

This blog will provide a step-by-step guide to load bulk CSV data into Aurora RDS.

Problem Scenarios: Let’s check on the common scenarios where Bulk Load operation on Aurora RDS is needed the most:

• Enriching operational data with insights

• Performance Benchmarking for the Bulk Data Load

• Data migration from on-premises databases to Aurora RDS

• Loading large amounts of data from external sources into Aurora RDS

• Creating a data warehouse for reporting and analytics purposes

• Populating test and development databases with sample data

Pre-Requisites and System Setup:

• AWS Aurora RDS > 11.1

• RAM: 16GB

• Storage Volume: General Purpose SSD

• CPU Configuration: db.r6i.4xlarge (2vCPU)

• Number of Nodes (3 nodes multi-AZ)

· S3 bucket to store CSV/.gzip files.

· AWS IAM user with appropriate permissions to access S3 bucket and Aurora RDS instance.

Let us detail on the creation of the resources mentioned in the below diagram and look at the process to optimized Bulk Load in Aurora RDS.

IAM Policy Setup for the S3 bucket:

AWS S3 Bucket is needed having the CSV file. In order to access the S3 bucket and Aurora RDS instance, we need to create an IAM policy with appropriate permissions. The policy should allow the IAM user to read from the S3 bucket and write to the Aurora RDS instance.

Create role to access S3 bucket files.

Create Trust Relationship with RDS Instance:

Attach the created IAM role to RDS writer instance AWS console:

Data load Process:

For data load we need to install postgres extension aws_s3, aws_common, from postgres command prompt we need to execute following command.

CREATE EXTENSION aws_s3 CASCADE;

SHOW rds.extensions;

To execute sql we need aws account connection token, Get the AWS account access credentials as below

<aws_access_key_id>: ID of AWS account

<aws_secret_key>: Secret Key for aws account connection

<aws_session_token>: Session token for aws account connection

Below is the SQL command format for data load which we need to execute from postgres command prompt

SELECT aws_s3.table_import_from_s3 {

‘POSTGRES_TABLE_NAME’, ‘COMMA_SEPERATED_COLUMNS_LIST’, ‘(format csv, header true)’,

‘BUCKET_NAME’,

‘FOLDER_NAME(optional)/FILE_NAME’,

‘REGION’,

‘AWS_ACCESS_KEY’, ‘AWS_SECRET_KEY’, ‘AWS_SESSION_TOKEN’

}

Operational Monitoring: After executing this data load process, we monitored some metrics and compared with Extension based bulk import Vs JDBC based connection write sql which are useful for comparing better resource utilization, like CPU utilization, Write IOPS, Write throughput.

CPU Utilization: When we compare CPU utilization metrics for a JDBC connection-based INSERT command vs Bulk Import, we can see utilization drop to 80% to 35%

*Fig1: CPU Utilization with bulk extension import*

*Fig2: CPU Utilization with JDBC based write SQL*

WRITE IOPS: For better utilization of disk write capacity we should have high I/O per second. With extension based bulk import, we could see gain from 5k to 10K, increase in 100% better utilization.

Fig3: Write IOPS with bulk extension import

Fig4: Write IOPS with connection based write SQL

Write Throughput: Similarly, there is double gain in write throughput (bytes/sec) processing in below figure.

Fig5: Write throughput with bulk extension import

Fig6: Write throughput with connection based write SQL

Anti-Patterns: The following are anti-patterns that should be avoided.

• Do not disable Multi-AZ

Amazon RDS for PostgreSQL with the Multi-AZ feature enabled increases availability by synchronously replicating data to a standby instance in a different Availability Zone.

The database can fail over to this standby instance if the primary instance becomes unreachable. Disabling Multi-AZ will yield better bulk loading performance, but it is not recommended for production workloads due to the durability and availability risk.

• Do not disable table logging

PostgreSQL truncates all unlogged tables on startup, including failovers and snapshot restores.

Unlogged tables are also not replicated to PostgreSQL replicas. Due to the risk of data loss, use of unlogged tables is not recommended for production workloads.

Limitations:
a) Data Load is not supported for Aurora version less than 10.7.

b) IAM Role cannot be associated with Aurora Serverless v1 DB cluster.

c) S3 import extension supports only .CSV and .zip file format of data file for bulk load.

d) Data Security should be ensured during the import process. SSL encryption is recommended to secure the data transfer between S3 and Aurora along with proper IAM roles.

Conclusion: In conclusion, bulk CSV data load in AWS Aurora is a powerful and efficient way to load large amounts of data into a database quickly and easily.

By following the steps outlined in this process, users can successfully load their data into Aurora and begin analyzing it immediately. With the benefits of Aurora’s scalability, performance, and security, users can feel confident that their data is in good hands.

Overall, bulk CSV data load in AWS Aurora is an excellent solution for users who need to upload large amounts of data quickly and securely.

Optimizing Bulk Data load of CSV data in AWS Aurora RDS

Written by Epsilon India