Save on storage costs with Databricks on AWS
The most common patterns of Databricks on AWS Cloud use EC2 instances as the compute layer and S3 as the primary storage layer. Data pipeline stages a.k.a Data lakes are often stored in S3 in various formats like parquet, JSON, CSV, etc. which have native integration support with Spark on Databricks. Data lakes evolved into Delta lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing, time travel, etc. on top of existing data lakes. This is the story of how S3 storage costs associated with Delta lakes on AWS S3 can be reduced significantly, in our case 95%.
Internals of Delta lake (Delta Log)
Before we jump to cost savings, it is important to understand the internals of a Delta lake, to get insights into how the data is stored. This is an excellent article about the delta table’s transaction log a.k.a delta log. For brevity, I will only provide a summary of the delta log concept.
- Delta tables are parquet tables with a transactional log.
DeltaLog
is a transaction log (changelog) of all the changes (actions) to (the state of) a delta table in JSON format. The delta log reflects the current state of the delta table at any given time.- Each commit is written out as a JSON file starting with 00000.json and as more commits happen the file names are incremented as 00001.json and so on.
- All transactions performed on the table can be broken into 6 actions namely add file, remove file, update metadata, set transaction, change protocol, and commit info.
- ACID transactions are achieved in a delta table, by making the data files (.parquet) immutable. E.g. if an X number of rows residing in file-1.parquet are impacted by a commit (update/delete), the file is never modified. All the rows in file-1.parquet are loaded to the memory, updates are applied and written out as a new file, say file-2.parquet. Subsequently, the delta log records this commit as file-1.parquet being removed from the state and file-2.parquet being added to the state.
- The model also helps time traveling through the table’s commit history for auditing and also to restore the table to any point in time. This is a beautiful lifesaver at times on large data sets. Restoration is possible because spark or the delta engine does not eagerly remove files no longer part of their state( like file-1.parquet).
Data Duplication
With these cool features of the delta table, there are side effects that can seriously inflate your costs if unchecked. Let us get deeper into points 5 and 6 from the previous part. Assuming file-1.parquet has 10K rows and is ~150 MB, after the commit, the storage becomes at least ~280 MB. Even if the commit modified 1K rows, 9K rows are still duplicated to file-2.parquet. With this happening in every commit, imagine the exponential growth of the storage cost. Before imagining, see our real-time S3 cost build-up for a table of 500M rows getting 3M updates per day.
The S3 Effect
The only way to delete files no longer needed for the table’s state is by running the VACUUM command on the table. The command takes a threshold (days), and only removes files beyond the threshold days so that the time-travel/restore is available for the threshold period. This is great and appears to have solved the problem of exponential file duplication, but there is one more caveat. Unless the S3 bucket backing the delta table is disabled for versioning, the previous delete operation only adds a Delete Marker to the files, and the actual version is still retained for X number of days for general purpose restoration. The X number of days is defined in the Lifecycle Policy of the S3 bucket.
Bottom line, if the delta tables are backed by a version enabled S3 Bucket with a very generous lifecycle policy, the storage cost will increase exponentially and mitigation efforts through VACUUM has very little effect. The bitter truth is, after a while, much of your S3 cost is towards the files that does not make up your table.
Finally, Where is the Savings?
Dedicated Version disabled bucket
The easiest solution to a problem is to not create the problem in the first place, right? Even though the delta table is naturally designed to duplicate data files, the VACUUM command effectively nullifies it without hindering its best features. You can see in the previous cost timeline, that just vacuuming the delta table only offers a slight reduction in the cost which is insignificant. If the lifecycle policy moves the data file to the S3 Glacier class, it only offers little relief as indicated in the timeline.
While this is the easiest solution, often times it is not favored in an enterprise due to the organization’s data retention, security, and versioning policies.
Lifecycle Policy
Having a lifecycle policy that creates a short life for the non-current versions (deleted S3 objects) will reduce the cost significantly. The cost timeline graph shows the difference in cost drop with a span of 90 days vs 7 days. This life should be the same as the threshold for the VACUUM command and is the period within which you can time-travel, audit, or restore the table. The more the period, the more the cost you pay for the storage.
Lifecycle policies are at the bucket level. If the bucket is shared with other objects requiring a longer lifetime, then this would be conflicting. More reasons to choose a dedicated bucket for the delta tables.
Programmatic S3 Vacuum Using Python
While the previous suggestions are not very practical in an enterprise, a few lines of code and a purging strategy will get us the best of all the worlds. A simple python program within the pipeline that filters non-current versions within the delta table folder that are last modified X a number of days ago (purging strategy) and deleting them permanently, would be equivalent to or even better than the previous solutions. Because, this way we solve the problem without losing the benefits of a shared bucket, better data protection, etc. AWS APIs support deletion in batches of up to 1000.