How Merging Files Has Reduced Our Data Access Costs by 90%
TL;DR: If you just want to know the optimisations that we performed over the years on our Data Lake, please skip to the very end!
The Start (Day 1)
As a company, we had decided on a micro-services based architecture utilising a Postgres AWS RDS database. In a single instance, we created multiple schemas and tables, where every micro service had a corresponding schema. The structure looks somewhat like the tree below-
◆ Every customer interaction or internal computation results in the insertion of a change in the database. This transactional database however cannot be used for analytical purposes due to multiple reasons-
Transactional tables in this Postgres database only capture the latest change of a particular entry. For instance, suppose you have an entry in the table which changed multiple times. Only the latest entry would be present in the transactional table, where age is 25. However, all 6 entries would be present in the Date Lake- making it more beneficial for getting a better understanding of the data.
◆ Given that transactional tables are used for customer facing interactions, using them for generating charts and reports may use up more database bandwidth, thus affecting the actual business. Hence it’s always better to have customer transactions and analytical or ML based queries running on separate databases.
Let’s have a Data Lake (Day 365)
Within a short period of time, we took the decision of copying all the CDC generated by our database (Postgres on AWS RDS) to a data lake. Although we were in a nascent stage and did not even have an Analytics Team, we decided to do this beforehand so that as and when our risk and analytics team was setup, they could easily start working on the data and building models from Day 1 without affecting our transactional systems. This would ensure a clear line of demarcation between the activities of the two teams.
For this, we decided to use AWS Data Migration service which turned out to be quite simple to set up. We pointed its source to the Postgres instance and its target to an S3 bucket.
Every change in the database is captured and stored in the S3 bucket. So if a single row is inserted into a table and then updated 5 times, the data lake would have 6 entries. Some major configurational changes that we added-
- We saved the data in s3 in parquet format which takes up almost 87% less space when compared to CSV. Also data scanned is 99% less. These factors overall reduce costs.
- We added a time_stamp column to the data such that whenever a row is getting written to the data lake, a time_stamp is added to it, so that we can track the timing of each change going in.
We used Glue Crawlers to crawl the data getting written to S3 by AWS-DMS. These crawlers created catalogue tables in Glue Catalog. The data is then queried via AWS Athena that internally uses Glue Catalog Tables along with data sitting in AWS S3.
Let’s Partition our Data Lake (Day 730)
Over time, due to rising costs caused by hundreds of Athena queries running everyday, we decided to partition the data sitting in our data lake (S3 bucket). While there are multiple parameters you can partition your data by, the most commonly used parameter and the one we chose was date.
So now the data sitting in the table name folder, instead would sit in date wise folders under the table level folder.
In order to ensure that fresh data comes in partitioned folder instead of root folder, the below configuration change needs to be added in the ExtraConnectionAttributes parameter of your Target Endpoint. In our case we passed the value “true” in the parameter DatePartitionEnabled.
However there is no automated way to move the old data to date wise folders. Hence you would need to write a small python script to get it done. One critical point to remember here is that partitioning the data will only help in case the queries getting run on Athena include the partition in the where clause. If not, partitioning the data will bear no fruit.
Let’s merge the parquet files! (Day 1825)
Over the years, our data boomed beyond leaps and bounds. So did the number of queries that we run and reports we generate on a daily basis. All of these were taking a toll on Athena– and hitting our S3 get requests costs. It made no sense that our S3 bills were hitting tens of thousands of dollars per month. We were out to get a solution that would save costs with minimal disruption. Here minimal disruption is the keyword to focus on given that multiple departments were now dependent on our Data Lake ecosystem and any major change would have massive business impact.
When using AWS DMS to capture CDC from your Postgres instance, you can set different parameters like BatchApplyTimeoutMin, BatchApplyTimeoutMax, BatchApplyMemoryLimit, BatchApplyMemoryLimit and BatchSplitSize. These parameters define how frequently your DMS creates new parquet files in the S3 bucket. Most companies looking at real time data flow, would preferably keep these parameters low, so that CDC flow from Postgres to S3 happens within 1–2 minutes. However the problem with this is that too many files will be generated even for a single day, and this is not recommended at all. “Avoid having too many files” is something AWS explicitly states.
Before we go ahead and discuss the solution we implemented, let’s look at the constraints we have-
- Cannot change AWS-DMS configuration. Data still needs to arrive in Data Lake ASAP.
- Cannot update the AWS-S3, Glue Catalog and Athena setup since that may require changes for too many downstream consumers.
- Need to ensure that the solution itself is cheap and can scale long term when the data size increases exponentially.
While the problem does sound complex, the solution we decided on was simple. The biggest challenge that we had is that we had no control over the latest day’s data- it would keep coming in new files. However the number of files that were present in “yesterday’s” partition folder was constant. We could easily merge them to a single file (or multiple files with n number of rows each).
Let’s look at the price difference between having a single file vs having 2000 per day for 500 tables-
Paquet Merging DIY
While we discussed what we did, it would be better if we explained how we did it so that you could also go ahead and implement the same in your Data Lake setup. For the daily runs, we are using a small step function of just 4 steps.
In the first step, we fetch the list of tables for which we will be merging the data, and get the “T-1” date’s folder for each table. Next we move to the map state. It runs in a loop once for each table. Here you can set a concurrency of 5 or 10 or more, based on how many tables you will be merging the data for. In this map state, we have 3 critical steps.
Step 1- We copy the files for that day into a backup bucket (you can skip this step if you don’t need the raw files).
Step 2- We merge the files and create the merged file in the original folder.
Step 3- We keep the merged file and delete all other files from the “T-1” date’s folder.
The process will repeat in a loop till this activity is performed for all the tables you have in your list. If the number of tables is too high, you may need to set a higher concurrency rate.
We are running this at 10 o’clock in the morning with a goal of ensuring that all the previous day’s data get’s merged within 1–2 hours (so max by 12 o’ clock).
One point to note is that this flow will only be taking care of live data flow. You will also need to run the system on all older data as well, so as to ensure that each (partitioned) date folder for a table has a single parquet file only.
In case having live data is not a requirement for you, you could also merge the data and dump it in a new bucket and let people query that new bucket instead. This can be done on a daily basis for “T-1” day’s data as long as users are happy with the delayed data. Given that we are a financial institution where every second counts, this isn’t a luxury we had, and we had to go for a backup, merge and delete route.
Now if you want to recap and get the gist of how we optimised our data lake over the years, this is what it would look like-
I joined the Tech Team at Credit Saison India on Day 1 and I have been working on the Data Platform itself for almost 5 years. While data infrastructure costs have always caused problems whenever we decided to build new features, the last optimisation has helped our Athena query bills plummet more than 90% by almost half a million dollars (and still decreasing every day).
We will be sharing the exact code and infrastructure that we used for this system in the 2nd part of this article! Till then, get grinding and if you are using parquet files and querying them via Athena, hope this saves you a ton of AWS budget!