S3 Retention Rules for Dummies
Introduction
Uploading files to S3 is a breeze, no matter in which mechanism you do it (Spark, command line, etc.). S3 is a great way to store data in the Cloud. But if you’re not careful, the costs associated with using S3 will become expensive very quickly. Let’s explore how that can be controlled.
S3 Costs Explained
New customers to AWS can get up to 5 GB stored in S3 for free. That can go far for people maintaining basic files in the Cloud; for big data, that’s nothing. Currently, the cost to maintain data in the us-east-1 region for the Standard storage class, which is the default, costs $.023 per GB for the first 50 TB/month. Assuming you process 1 TB’s worth of data and store it all in S3, you’re paying $23 per month for that data. That doesn’t seem like much, but don’t forget the other data you’ve stored to this point. You’re also paying $23 per TB per month for that data as well. This will just continually increase if something isn’t done.
S3 Storage Classes
A storage class in S3 represents how the data is stored in the Cloud. Standard storage is the default, and this is supposed to represent data that will be accessed rather frequently. It’s the most consistent and as a result, the most expensive option.
Infrequent Access is for data that doesn’t need to be accessed as much as the rest. For example, consumers may not be as interested in data from 6 months ago as they are in the current snapshot of the data. This can still be easily retrieved, but not as fast as Standard storage (but quick enough that you won’t notice a huge difference). That latency results in only a cost of $.0125 per GB. An extension of Infrequent Access (or IA as it’s more commonly known) is One Zone IA, meaning the data only resides in one availability zone instead of being replicated to the others in the same region (only $.01 per GB).
Glacier, as its name might imply, is the slowest retrieval data. This is essentially used to keep data that you’re only going to need to bring out for auditing purposes or reprocessing. The data will take a few hours to bring back to “hot” storage, but it is really cheap compared to standard storage (around $.004 per GB based on the option you choose).
Lifecycle Policies
Not all data in the same S3 bucket needs to be in the same storage class. You can configure lifecycle policies to handle the transition (or even the eventual deletion) of data at various points. For example, you might decide that after 3 months, data in a certain folder is no longer as important, and it can go to IA instead of being in standard storage. Maybe after a year, it can go to Glacier. You can also configure data for deletion, which will remove it after some specified period of time (in which case, you don’t have to pay for its storage at all after that point).
Lifecycle policies can be configured under the properties tab of an S3 bucket in the S3 console. You can also apply these at the prefix level if you want to handle different folders in the same bucket differently.
Intelligent Tiering
Sometimes, it’s easy to know the access patterns of your data. You either know the data like the back of your hand or your data governance team has default policies that should be applied so that PII is handled appropriately. But what if you have no clue?
S3 released the Intelligent Tiering storage class a few years back to handle this scenario. Intelligent Tiering will analyze the access patterns of your data to figure out which storage class each file should belong to. This service costs a little extra on top of the storage costs you’d be paying normally, but it is a great solution for those who want to have that analysis work automated.
Recommendations
Definitely work not only within your team, but also with external teams who access your data on a regular basis to figure out what the best lifecycle policies are for your data. If you apply these policies without consultation, it could get messy to restore anything that was actually needed but is now deleted. Also make sure these policies work with your governance team, so that sensitive data is being handled appropriately.
Conclusion
I can speak from experience that S3 can get really expensive if it’s not handled with appropriately. When looking at our Cloud costs for a project I was working on a few years back, I noticed that our S3 spend, compared to other services, was huge. I realized that we had few lifecycle policies configured and we were essentially paying for everything to be stored in the Standard storage class. After working with the team to put lifecycle policies in place on the remaining buckets that needed them, we were able to save tens of thousands of dollars on our S3 spend.
When working with data that will be stored in S3, make sure to establish how data will be stored in the early stages of the project rather than later. The sooner the appropriate policies are put in place, the less you’ll have to pay upfront and down the line. Don’t let yourself get nickeled and dimed.