The Glacial Lakehouse: Glacier + Delta for Low(er) Cost S3 Backup

Greg Wood
Databricks Platform SME
9 min readMay 9, 2024

Using Delta Archival Support plus S3 Glacier to achieve reliable, long-term backups at lower cost

Intro to Tiers

Ask any cloud admin which component of their platform gives them the most cost anxiety, and you’ll likely hear one answer over and over: storage. What starts out seeming like a bargain (what’s a fraction of a cent per GB, after all?) ends up becoming one of the most difficult services to control and forecast. On top of the cost for the storage itself, you also pay each time you access, list, or move an object (and sometimes even when other people try to do so!). A strong governance strategy, restrictive read/write policies, and overall organizational awareness can go a long way towards slowing cost creep, but for true long-term control, they alone are not enough. This is where storage tiers come into the picture: services like Azure Archive and S3 Glacier allow resilient and lower-cost storage of files. These storage tiers are especially impactful when combined with lifecycle policies that automatically transition files of a certain age into different tiers. For example, if I know that certain logs in a given S3 bucket are accessed heavily for the first month, accessed lightly for the next month, and then stay indefinitely archived for regulatory purposes, I might create a policy like this:

For the buckets under this policy, the cost will drop by about 50% after the first month, and then to $0.00099, or about 5% of S3 Standard Tier. Sounds like a steal! In fact, why not just put everything in Glacier? As always, there are asterisks when it comes to pricing.

The Fine Print of Tiering

Everyone would love to pay 95% less for storage, but this would, unsurprisingly, be a poor business model for cloud providers. Glacier, like all other storage classes, needs to be balanced for the total cost of ownership versus suitability for a given application. For example, S3 Standard can be accessed at any point and provides moderate performance for most applications; S3-IA (short for infrequent access) provides a similar performance profile, but has an increased access cost and incurs a penalty if objects are deleted within 30 days of creation. Glacier Deep Archive, on the other hand, may take up to 12 hours for file access and has a minimum retention period of 180 days.

As with most cloud services, there is no one right answer when it comes to storage, and you’ll more than likely do some trial-and-error testing before you find the most appropriate blend of tiers for your applications.

Backup & Restore on S3

Backup & Restore refers to a process where data is moved to a secondary location as a backup copy. Often, this backup copy of data is logically and physically separated from the primary copy to provide the highest possible resiliency. In the case of a disaster, accidental deletion, or malicious attack, the backup data can be used to restore the primary data, which will ideally limit any business impact and downtime. Backup & Restore is similar to, but distinct from, other operations such as Disaster Recovery and Data Archival; all of these play into an organization’s overall Business Continuity (BC) plan.

Although backing up data is a critical piece of a sound BC plan, it can easily become the plan’s most expensive aspect. Maintaining multiple copies of data across regions is, after all, just about the worst case scenario in terms of storage cost! This is where tiering becomes essential; storing backup data in a cost-effective tier can mean the difference between doubling or tripling your storage bill and maintaining a steady, forecastable spend level.

Let’s take the example of backing up 10TB of data from an S3 bucket in us-east-1 to us-west-2. We need to maintain the primary copy of data in us-east-1 for 90 days, after which it can be deleted. In the most naive case, we’d set up a pipeline that replicated data between regions, and a lifecycle policy that deleted the primary copy after 90 days; the backup copy would persist indefinitely. Here, both copies just remain in S3 Standard.

The one-year cost of this setup would be about $3500; you’d pay $235/month for the raw storage cost in each bucket; since the primary copy drops away after the first 3 months, you’re left paying 15 months worth of storage costs, plus about $25 for the API and data transfer costs. Ongoing costs after year 1 would level out to about $2800 for the backup copy in us-west-2. Now, let’s consider a similar setup that uses a lifecycle policy and takes advantage of tiering; we’ll transition data to S3-IA after 30 days and then delete after another 60 days; in the secondary region, we’ll transfer immediately to Glacier Instant Retrieval.

In this scenario, we achieve very similar data availability characteristics, but with a one-year cost of about $1000. After the first year, the cost would drop off significantly to about $500 to maintain the ongoing Glacier instant retrieval storage in us-west-2. This is about 33% of the cost for year 1, and less than 25% of the cost in years 2+, compared to the non-tiered option above; we could get these costs even lower using Glacier Deep Archive or Flexible Retrieval.

It’s worth noting that this is an overly simplistic use case; it’s very unlikely that you’ll have a one-time backup with no additional writes. We’ll tackle a more realistic scenario below.

Using DEEP CLONE for Backup

Delta Lake has become a leading storage format for Lakehouse architectures. Because it colocates metadata with data and has useful compaction features built in, it is also an ideal format for performing backup and restore workloads, especially on Databricks where DEEP CLONE is available. It’s incredibly easy to write a Delta table to an arbitrary location- such as a bucket in another region- in a transactional, incremental, and automatable way. For example, to write a table named important_data to a bucket named my-backup-bucket, we can use this command:

CREATE TABLE delta.`s3://my-backup-bucket/backup/path`
DEEP CLONE catalog_name.schema_name.important_data

We can repeat this operation multiple times to write incremental updates to the same location- no changes to the code needed. This bucket can be in the same region or account, or an entirely different region and/or account. Now, say we want to restore this table to a new metastore where it doesn’t exist at all. We’d just use the following command:

CREATE TABLE catalog_name.schema_name.important_data
DEEP CLONE delta.`s3://my-backup-bucket/backup/path`

In other words, we just do the reverse to restore the table. Because the metadata and data are colocated, we don’t need to specify schema, worry about version history, or do any prior setup- we just get an exact copy of the table from the last time we performed a deep clone.

Tiering with Delta Archival Support

So, we now have a backup mechanism for Delta Lake tables: we can do a relatively simple DEEP CLONE operation to drop tables into a secondary bucket. How does this work in practice when we combine it with storage tiering such as Glacier? It’s important to understand a little bit about how Delta tables interact with S3 to answer this question. The Delta manifest contains most of the important metadata about a table, such as its schema, history, and the files that make up the different table versions. When you query a Delta Lake table, the query first goes to the manifest file to see which files it should scan based on the query structure, and then goes and retrieves those files for processing. We expect these files to be immediately accessible; if a file is missing, we’ll receive an error if we issue a query that attempts to read that file.

Note: this is why we generally do not recommend using geo-redundant storage such as S3 multi-region buckets or ADLS GRS. Files are not guaranteed to be copied in order or transactionally, so you may end up with files that are missing, but that the manifest expects to be available.

For Delta, files that are in Glacier Flexible Retrieval or Deep Archive are considered inaccessible. Although the files do exist, they must be retrieved before being read, which means that the query will still fail as if the files were deleted. This is where Delta Archival Support comes in. This feature allows you to set an archival window that matches your underlying lifecycle policy; as files age out of this window, they are treated by the manifest as if they have been deleted. This means you can have a table that has been partially or wholly transitioned to Glacier for backup without breaking it, which is particularly useful for backup & restore workflows.

Example Backup & Restore Workflow with Delta

Putting it all together, let’s look at what a Backup workflow would look like using Delta, Glacier, and Archival Support. Here, we’ll assume a slightly more complex and realistic scenario, where we write 200GB of data to the primary bucket every day. This could be a mix of appends, updates, and deletes; since Delta rewrites files for each new table version, we don’t ever alter existing files, which is important for using Glacier.

Let’s assume we have the following lifecycle policies on our primary and backup buckets: for the primary bucket, we transition objects to S3 Standard-IA after 30 days, and delete them after 60 days. In the backup bucket, we transition objects to S3 Glacier Flexible Retrieval after 1 day. We’ll assume we keep the Glacier objects backed up for 1 year.

Let’s assume that the 200GB written each day is incremental, such that at the end of the first month, there’s 6TB in the primary bucket in S3 Standard. The cumulative costs per region and storage tier would look like this (we’ll ignore API and transfer costs for simplicity, and assume our regions are us-east-1 and us-west-2):

The total cost for year 1 here would be about $4,000 with an ending volume of 85TB across all regions/tiers. Assuming the workload remains stable in year 2+, the yearly cost thereafter would be ~$5800, since objects >1 year old would age out of Glacier. In region 2, we would also run ALTER TABLE mytable SET TBLPROPERTIES(delta.timeUntilArchived = ‘1 days’); to set archival support on the table; this would ensure that if you did need to query the data in the backup region, the query would not fail. This is especially useful if you intend to maintain more than 1 day in S3 Standard in the backup location; for example, you might keep 30 days in Standard before transitioning to Glacier.

For comparison, keeping all data in S3 Standard and using a similar workflow to replicate across regions would result in a Year 1 cost of more than $20,000 and a Year 2+ cost of more than $40,000. That’s a 5x and 7x increase in cost, respectively!

Conclusions

In this article, we discussed basic tiering considerations for cloud storage. We covered how to think about backup and effectively leverage storage tiers. We also explained how to use DEEP CLONE and Delta Archival Support to create a cost-efficient backup strategy. We mostly focused on AWS S3 in this article, but everything here is equally applicable to Azure as well! Like all things cloud, storage tiers take some thought and planning to do right, but can give you a huge amount of cost reduction. Keep in mind that this only solves for data; if you have applications, such as Databricks, that also need to be backed up, you’ll need to consider the options specific to that app (for Databricks, the excellent Terraform Exporter is a sound choice). This also assumes you have somewhere to restore data to; in case of a regional outage, you’ll most likely need a DR strategy. Fortunately, we’ll be writing more on DR in this channel, so keep an eye out for more stories soon. Until then, keep calm and tier on!

--

--

Greg Wood
Databricks Platform SME

Master of Disaster (Recovery), Purveyor of Platform. @Databricks since 2018.