Cloud Data Platforms: Data Retention and Minimalization By Design

Published in

TotalEnergies Digital Factory

9 min readJan 29, 2024

Data platforms and digital transformation

The use of data platforms has become one of the critical components of the cloud technology required for companies to deliver digital transformation and to enable easier data sharing across enterprises.

Ensuring that the content on the data platform is fully managed is fundamental to ensuring costs are minimized, financial and legal risks mitigated, and the datalake not turning into a data swamp.

In my previous Medium article about Dark Data, I explored the issues that companies were facing from the deluge of data and not knowing what data they had or the usefulness of it. Cloud data platforms could exacerbate this problem, since they often contain a copy of data from other sources to centralize, process and share multiple sources of data.

In this article I will explore the options available to us to minimize the data footprint on cloud data platforms.

Data Lifecycle management on data platforms — data retention

It is critical that companies apply sound data management and data governance policies and practices to control and manage the content on cloud data platforms. One of these policies is that of data lifecycle management, where data is tracked through various stages as shown in the diagram below.

The key stage is the disposal of the data at the end of the lifecycle. In order to know what to delete and when to schedule it, the data has to be classified and catalogued, and the appropriate data retention rule applied to it, depending on the type of data and the business context of its origin and usage. These retention rules are needed for regulatory, legal, contractual or practical reasons.

For example, in TotalEnergies, a data protection and privacy process is activated for all new digital initiatives and projects, to ensure that all data being used and generated by the new solution is analysed and the relevant data retention periods documented so they can be acted upon.

However, implementing a practical and functional data lifecycle management process is easier said than done. In fact, according to a survey done by the EDM Council in 2023, only 10% of companies have achieved the implementation of a data lifecycle plan where data retention, archiving, and purging is managed according to a defined retention schedule.

Why is the practical implementation of data lifecycle management so difficult?

A fundamental characteristic of data is that it can be copied, transformed, moved, distributed, iterated and shared — this makes tracking it a very difficult task.

Tracking data can be like “herding cats” (image generated by Bing Chat Enterprise)

Without proper classification and capturing metadata (the contextualization of data) along the data lifecycle, the data cannot be identified and tracked and hence the data lifecycle retention rules cannot be applied to delete the data.

The second major difficulty is to know what data is being used by what business use case and how the data lifecycle retention rules triggered by each business use case interact and overlap each other. For example, if one business use case is finished with the data, the data cannot be deleted if another business use case requires the continued use of the same data. However, both business use cases may need to comply with a default or overall data retention rule for legal, commercial or other reasons.

Cloud native data lifecycle management

Let’s explore what features native cloud data platforms can provide in terms of managing data lifecycles.

On most cloud data platforms, lifecycle management policies can be applied to the content on the datalake. Rules can be set-up to filter files and apply a retention period to them, such as moving them to a different data storage “cost” layer and eventually deleting them.

Storage classes on Amazon AWS and Microsoft Azure data platforms

The dates for the data retention period usually come from the native file/object properties, such as the created, modified or last accessed date. However, these dates may not be reliable indicators for the data retention rules to be based on. For instance, cloud data storage maybe upgraded or updated to new datalake environments and files “copied” during this process, which may result in the created and modified native dates being reset to the date of copy, which means that any data retention rule based on these dates will now be incorrect. More reliable ways of capturing the dates are to specify the date in native metadata or in data catalog metadata, or specifying the date of creation in a standard file naming pattern.

A common filtering feature of objects used within cloud data platform retention policies allows a folder/file/prefix search kind of pattern. This implies that the datalake structure will need standard folder and file naming conventions to allow for logical filtering.

The best time to set-up and configure the data retention rules is when data is ingested or created on the datalake. On Amazon AWS S3, object tags are a good way to apply data lifecycle management and these should be specified when the data object is created. For Azure Blobs, the metadata headers should also be set carefully at object creation because any future modification fully overwrites the existing metadata values.

In practice, setting up cloud native data lifecycle management needs careful thought, design, automation and maintenance, and communication with all technical and business groups responsible for their part of handling and managing the data.

Cost lifecycle management and personal data limitations — data minimalization

So far, we have been discussing the application of data lifecycle management according to regulatory, legal and contractual data retention rules.

However, there are other factors to limit and minimalize the footprint of data that needs to be considered, including:

The cost of storing data. Although the cost of storage on cloud data platforms is relatively cheap, as data volumes increase the cost of storing it can steadily mount up. For example, the costs of storing sensor data for a preventive maintenance use-case digital solution in TotalEnergies, has steadily built up during just one year of operation. If we scale this up to multiple use cases across various engineering assets in TotalEnergies, the storage cost becomes significant. In this case, the situation was highlighted in FinOps processes to capture and monitor costs, and subsequently obsolete data was removed and future housekeeping procedures set-up.

The environmental impact of storing data. Currently, data centers are estimated to account for 1% of energy-related GHG emissions. This is expected to rise as more aspects of our working and personal lives are digitalized. In the US for example, power demand is predicted to grow by 4.7% over the next five years, compared to a previous estimate of 2.6%, largely due to the growth in data centers. The big tech companies are doing their upmost to transition to the use of green-sources of energy and recycle and preserve as much water as possible to cool their data centers. For example, TotalEnergies provides renewable energy to Amazon to power its operations. One of the ways that organizations themselves can minimize their carbon footprint on cloud data platforms is to consider the impact of storing unnecessary data (“dark data”). Implementing data retention policies can be part of this approach.

Storing data only for the length of time it is needed. In some data protection guidelines and laws, such as GDPR, personal data should only be kept long enough for it to be processed for its stated purpose.

Detailed knowledge of the data lifecycle is needed and the data contextualized to know what data can be deleted or moved to lower cost storage tiers, and after what period of time, to minimize the data storage costs and environmental impact, or keep it just for the period of time it is needed.

Data retention and minimalization by design

Building digital solutions that use the data platform is a great place to start and contextualize data and implement detailed data retention rules.

Since the data feeding the digital solution and any data it generates, will have to be rigorously analysed, scoped and understood, the appropriate data retention rules can be formulated at the same time. Also, if the same data is being used by other applications or users, data retention rules can be compared and contentions resolved. This implies that information about what data is being consumed by who needs to be captured, either through data catalogs or data observability platforms, for example.

This process can be called “data retention and minimalization by design”, as illustrated in the diagram below.

A good example of the importance of contextualizing data in TotalEnergies is the data that is needed to train data science and machine learning models for digital solutions being developed in the TotalEnergies Digital Factory. After a period of time, the model may have to be re-trained, so the previous training dataset can be deleted if it is not required as part of the new training dataset. Clearly a regular training schedule would assist in the automation of applying the data retention rule, otherwise a manual activity would need to take place to delete the unwanted datasets.

A second example from TotalEnergies is a digital solution that manages chemicals being used within a facility; once the raw data on the datalake is processed, it is no longer needed so it can be deleted after 1 month, even though the official data retention period for this type of data is based on the lifetime of the facility. Since the raw data on the datalake is a copy, it can be safely removed, as the original data can still be accessed if necessary.

Once the data retention rules are clear, these rules can be coded inside the digital solution itself to delete the data when necessary.

Any data that the digital solution needs to keep, for example, generated data or results, can then be treated by the native cloud lifecycle management policies.

Normally, the data retention rules to enable data minimalization are going to be on a shorter time frame than the rules required for regulatory, legal or contractual reasons.

Focus on Generative AI

Data lifecycle management and minimalization is needed more now considering the recent hype of Generative AI and the huge amount of data it uses and generates, especially since LLMs are being integrated into data platforms, also known as data intelligence platforms.

This will be a challenge for companies as they will need to keep a balance between minimizing data for the reasons explored in this article and keeping enough data to feed and fine-tune Generative AI models.

A note on databases

The same policy for applying data retention to files should also apply to data within databases in the cloud. Clearly, native cloud lifecycle management policies only apply to files and not to databases, so database archiving and deletion has to be done as part of database management, whilst the deleting of data itself could be done as part of the data retention and minimalization design methodology described above. The NoSQL database vendor MongoDB has an interesting feature called “tag-aware sharding”, which allows data to be moved between different cost storage tiers based on tags alone, so non-active data can be moved to a lower cost storage tier, whilst keeping all the data in the database and enabling any application code to continue to work, even though data is moving between storage tiers.

Wrapping it up

The concept behind data lifecycle management is clear and the reasons for doing it can easily be justified from a regulatory, financial and legal point of view, but implementing it in practice is complex and something that companies struggle to deal with.

To help companies become more successful in managing the data footprint on their data platforms, a two-step approach is recommended to minimize the data footprint on cloud data platforms:

By applying data retention and minimalization by design using detailed knowledge of the data lifecycle and fine-grained contextualization for specific use cases and digital solutions; to be able to delete data or move it to lower cost data storage layers depending on the business usage.
Using cloud native lifecycle management policies to apply regulatory, legal and contractual data retention rules to sweep up and delete data that the previous activity did not act upon.