Split File using ADF COPY and AWS script

Read Time:2 Minute, 55 Second

In this post, we will explore the process of splitting a large file into smaller ones based on the data count. Recently, we encountered a file in our Blob storage area that contained approximately 20 million records. The client is required to process this file in Snowflake.

While it’s possible to process this file as-is in Snowflake using the COPY command, it’s important to note that doing so may lead to performance issues due to the file’s substantial size. Additionally, Snowflake itself recommends avoiding this approach as it can result in increased warehouse costs, as the warehouse remains occupied during the entire processing period. As per Snowflake’s documentation:

“Split larger files into a greater number of smaller files to distribute the load among the compute resources in an active warehouse. The number of data files that are processed in parallel is determined by the amount of computing resources in a warehouse. We recommend splitting large files by line to avoid records that span chunks.”

To address this, we utilized the ADF COPY activity command to divide the large file into smaller, more manageable files. Our approach was to segment the data so that each resulting file would contain approximately 2 million records.

Below, you can see the original file that was located in our Container area.

ADF Details:

Azure File

We developed the below simple ADF pipeline leveraging the COPY utility to split and copy the data into the destination container.

ADF Pipeline

In the Sink, we can set the number of Rows per file, so in this case, I setup a 2000000 count for each file. This will split the source file based on the defined number of records.

Row count Split

After the execution of the Pipeline below, 10 files get created inside the destination container.

Small Files 2M each

To process the file into Snowflake we need to create Secure Integration, stage, file format etc…

Now load the data using COPY with Pattern command.

copy into cust_tbl
from @demo_db.public.az_stage
on_error = CONTINUE
pattern='LOAD0000005_x.*.csv’;

AWS implementation:

What if we encounter a similar file in an AWS S3 bucket, and the customer doesn’t have access to Azure Data Factory (ADF) or Azure licenses? In such cases, we have the option to create a UNIX script that can split the files based on the number of lines and execute it using the AWS CLI.

AWS Big file

Technical implementation:

Login to the EC2 machine and create the shell script (Split_huge_file.sh) with below content. Here we have separated the file with 20 Millions lines per file. Hence 9 files should be generated.

S3_BUCKET=ec2s3
FILENAME= LOAD00000005.csv
INFILE=s3://"${S3_BUCKET}"/"${FILENAME}"
OUTFILE=s3://"${S3_BUCKET}"/"${FILENAME%%.*}"
echo $S3_BUCKET, $FILENAME, $INFILE, $OUTFILE
echo "Starting S3 File Splitter using line count of 20000000 lines"
echo "Splitting file: "${INFILE}" ..."
FILES=($(aws s3 cp "${INFILE}" - | split -d -l 20000000 --filter "aws s3 cp - \"${OUTFILE}_\$FILE.csv\" | echo \"\$FILE.csv\""))

Execute Shell

Split File in AWS

Now load these files parallelly in Snowflake.

Snowflake Load

The Script is originally published at:

https://cloudyard.in/2023/09/split-file-using-adf-copy-and-aws-script/

--

--