How to maintain order in the versioning and lifecycle policies of your S3 buckets

Jacek Małyszko
Fandom Engineering
Published in
9 min readFeb 24, 2023

I know that topics such as cloud storage may not seem exciting, but I find S3 to be just as cool as it can be. With S3, you can store as much data as you need, and add an analytics layer on top to create a fully operational Data Lake. This has proven to be very useful for the Data Engineering team at Fandom.

Still, as you accumulate numerous datasets and pipelines, things can become complicated and expensive. S3 provides options for object versioning and lifecycle policies, but if you do not pay attention to these things, at some stage you may find yourself in a situation where you have no idea what data actually still resides on your S3 and what is generating your higher and higher costs on AWS bills. Not to mention losing your data due to misconfigured buckets and human errors.

In this post, I aim to describe our efforts to create a management framework for S3 buckets using two mentioned S3 configuration settings: versioning and object lifecycle policies. The outline of this article is as follows: we will begin by providing some basic information on the S3 configuration options that were the focus of our cleanup effort. Subsequently, we will present the framework we have defined for S3 bucket management, which includes a labeling scheme for S3 buckets, along with a mapping of the defined labels to specific S3 configuration settings.

The described approach is not very sophisticated or complicated, but we believe that it may be successfully used to limit confusion around management of S3 bucket settings and actually help prevent data loss or generating unnecessary AWS costs.

Versioning and lifecycle rules — why it’s important to get these settings right

As per official AWS S3 documentation, versioning in Amazon S3 is a means of keeping multiple variants of an object in the same bucket. You can use the S3 Versioning feature to preserve, retrieve, and restore every version of every object stored in your buckets. What is important is that from a cost perspective, all versions of each object are stored on S3, which means that we need to pay for the storage of those files as long as they remain on the platform. These previous versions are only somewhat hidden from the default view — either by newer versions of these objects, or by so-called deletion markers, in case when users deleted these objects. To help manage these costs, AWS offers Object Lifecycle Management. It allows, among other things, to set object expiration of previous versions of objects (both that were overwritten or deleted), that is to define when these previous versions are finally removed from S3.

On the technical side, applying configuration settings for versioning and lifecycle rules is straightforward. You just need to find the appropriate configuration dialog option on AWS console and fill it appropriately (or preferably use some IaaC solution for that). However, ensuring that a consistent set of these settings is applied across all your buckets can be challenging, particularly when you have multiple buckets storing data with varying characteristics. Without proper attention, your buckets configurations can quickly become disorganized, making it difficult to track which rules are assigned to which bucket and why.

For instance, we once faced a situation where some data that should have been “deleted” from our S3 was still being stored for many months. This happened because, as previously mentioned when versioning is enabled, removing objects does not actually delete them but only hides them using delete markers. If there is no lifecycle rule in place to remove these historical versions of objects, they will be stored indefinitely. With dozens of buckets to manage, it’s easy to overlook unnecessary data when you have petabytes of it in your data lake. However, Amazon never forgets, and this will result in higher AWS bills. Even if the monthly cost of storing the data doesn’t seem significant, if you continue to pay for that storage every month for a year or more (until someone discovers forgotten historical versions), the costs can pile up, sometimes resulting in surprisingly high expenses.

On another occasion, one of our engineers was using a test bucket to store intermediate datasets for a large, multi-step backfill task of production-grade data. Unfortunately, we lost several days of work when a different engineer removed the data as he wiped out the contents of the entire test bucket. Unfortunately, versioning wasn’t enabled for the bucket at all. This caused us to wonder if buckets that we use for storing test data perhaps should have versioning enabled as well, probably with a lifecycle rule that removes obsolete object versions quickly (i.e. in a matter of a few days). As your team and dataset sizes increase, situations like this become more common.

Our improvement procedure

After encountering with these kinds of situations, we decided that it’s a good idea to try to deal with the mess that we had in a systematic way. We followed the following procedure:

  1. We created a simple labeling scheme for our buckets. You can think of it as defining certain labels that describe the most important characteristics of datasets stored in each S3 bucket. Our goal was to make the labeling scheme as straightforward as possible while still covering all possible scenarios
  2. Next, we mapped defined labels to the appropriate S3 configuration settings
  3. We reviewed our buckets and assigned appropriate labels to each one based on the datasets stored within them. Each bucket may have only one label and we should choose one that best describes what kind of data is stored in this bucket
  4. Next, we created Terraform code for the lifecycle rules that were defined in the previous steps. Each lifecycle rule was named according to the set of labels it corresponded to.
  5. Finally, we applied the Terraform configuration to our buckets.

Below we describe the first two steps in detail. Next steps (3–5) are more technical; the following steps were just a logical extension of the first ones and do not need additional description.

Bucket labeling scheme

We conceptually defined the following labels for our datasets depending on the type of data that is stored in it:

  1. Long-lasting production data; losing this data would either incur significant costs (due to the effort and processing power required to backfill it) or be irreversible, and there is a business requirement to retain the data for an extended period.
  2. Transient production data; These data have clear business value and are used by some teams within the organization, but their loss is not particularly problematic. The data is only valid for a short period of time and is recreated regularly, or the data can be easily regenerated in another way.
  3. Test data; datasets for testing and/or development purposes. As with production buckets, here we also may introduce distinction into long-lasting and transient datasets (i.e. the test dataset may inherit this characteristic from the corresponding production dataset)
  4. Analytical data: data created by data analysts, usually for needs of some temporary analytical tasks. If the data needs to be stored for a longer time it should be categorized as “Long-lasting, production data” instead
  5. Binary data: used to store e.g. application codes required for running ETL jobs (jar files, py files etc); this may not relevant for other organizations, but it was important in context of Fandom Data Engineering team
  6. Secrets: used to store e.g. certificate keys etc. This should be extra-safe as far as the access rights are concerned, but this is out of scope of this post.

Additionally, we also defined two importance labels to give more context for each dataset. These labels are:

  • Business-critical: This label is assigned to all buckets related to data that we know is still in use by some teams at Fandom (both to production and testing buckets). Losing this data would significantly impact our company.
  • Non-business-critical: This label is assigned to buckets that store less important data. Losing this data would not affect us as severely. In many cases, we apply this label to buckets that store obsolete data that is being kept only for historical purposes (e.g., when a pipeline delivering this data has been deactivated and we keep this data “just in case”).

Versioning settings based on the assigned labels

Having these two sets of labels, we wanted one of two situations to be true for each bucket:

  1. Versioning is enabled.
  2. If versioning is not enabled, we need to know the exact reason why.

We began by going through each label and discussing if versioning is needed for buckets with this particular label. To help make the final decision, we tried to provide justification if the versioning is relevant for the given class or not. At this stage, we came up with the table presented below. Please note that it’s valid for a specific company in a specific context and it may not apply to your case, but it presents well a kind of reasoning that you may want to reproduce for your needs.

As a next step, we included in the data importance labels in our analysis. As previously mentioned, we assigned this label to each bucket as well. When considering this label, decisions were made based on a table similar to the one below (note that this is just an example and the Yes/No values may differ based on specific needs):

As a result, we may now simply say what versioning settings each bucket should have just by checking which labels were assigned to it.

Proposed mapping of labels to lifecycle rules

Next, for each set of labels where versioning is enabled, we needed to define a lifecycle rule that specifies when historical versions of objects (overwritten or hidden by delete marker) should be removed. On the one hand, the more conservative the rule (i.e., the longer we wait before the object version is finally removed), the safer we are in terms of the risk of removing some data that may turn out to be still needed. However, at the same time, we’ll need to pay more for storage, as we keep things longer on S3.

Our main goal was to ensure that there is no risk of storing historical versions indefinitely and generating costs for potentially many months or years. Therefore, it’s unacceptable for ANY buckets to have versioning enabled without a lifecycle rule for removing historical versions defined.

We came up with the following mapping. As a general rule of thumb, for buckets with higher importance, we may want to keep historical versions of objects longer. Again, in context of your organization you may come up with completely different rules.

Next steps and summary

After preparing the labeling scheme, we can proceed to applying labels the buckets. For example, we may label some buckets as “Long-lasting production data + business-critical” or “binary data + non-business critical.” Next, using on the chosen Infrastructure as Code (IaC) solution, such as Terraform, we define lifecycle policies for each label configuration as defined in the table above. Finally, based on the assigned labels, we configure each bucket with the following settings:

  • Enable or disable versioning for each bucket based on its assigned label. For example, “Long-lasting production data + business-critical” buckets will have versioning enabled, while “transient test data + non-business-critical” buckets will have it disabled.
  • Assign an appropriate lifecycle policy to each bucket based on its assigned label.

In summary, after completing these steps, we achieve the following outcomes:

  • Our S3 buckets are managed according to defined rules, making it easier to set settings for new buckets. We simply need to label them based on the data they store.
  • Each bucket managed according to this framework incurs no unnecessary AWS costs related to forgotten historical versions of objects that are no longer needed.
  • If we decide to change the rules for a set of labels, such as the period of removal of historical versions of objects, we can modify the defined lifecycle rule once in our IaC code, and the changes will be automatically applied to all buckets with corresponding labels.

One final disclaimer: any such changes need to be set and applied carefully and reviewed by many teams and team members and well adjusted to the context of your organization. Still, we hope that effort put in applying such a framework may prove to be useful in the long run for many organizations. Take care and keep your AWS bills low!

Originally published at https://dev.fandom.com.

--

--