Apache Ozone Snapshots : Addressing different use cases

7 min readJun 26, 2024

Overview

The purpose of this blog post is to highlight different use cases for a snapshot feature for an object store. It will cover details about these use cases and how the snapshot feature makes it easier to address them in an application-consistent manner. The purpose of this blog is also to highlight the performance and space efficiency that can be achieved with these use cases by leveraging the snapshots feature. This blog is part of the series on Apache Ozone Snapshots. In the subsequent blogs, we will continue to cover more details and various aspects of this feature.

Apache Ozone Snapshots

Apache Ozone is a highly scalable, highly available, distributed, and secure object store that can handle billions of keys. It is a fully n-way replicated and strongly consistent system that offers both Object-Storage as well File-System semantics. Apache Ozone doesn’t have any single point of failure either for the metadata or the data. It is compatible with Amazon S3 APIs as well as Hadoop Compatible FileSystem (HCFS) interface. It integrates seamlessly with YARN, Hive, Impala, Spark, and more, out of the box, and is a preferred choice for on-prem storage at large enterprises for analytics and machine learning workloads. It’s also gaining accelerated adoption for a variety of use cases including backup, storage, archival, and incremental scale-out storage.

Apache Ozone Snapshot feature was released with Cloudera Data Platform CDP 7.1.9 and it allows users/applications to take snapshots at a bucket granularity. A snapshot of an object store bucket captures a point-in-time image of the active object store bucket at the time of snapshot creation. Snapshot creation is an instantaneous operation.

Snapshots for Data Protection

Data protection is an integral part of any data lifecycle management policy for an organization. This section discusses the protection mechanism that is required for application data and how snapshots make it easier to achieve this.

Preserving the Application State

An application is typically modifying its data and metadata through various transformations as the business logic demands. The application or the system itself can crash in the middle of these data transformations. If we were to reboot the system and restart the application, it may get into a weird data/metadata state that it cannot recover from. This is because the state of the application that’s persisted on stable storage, may not be in an application-consistent state. Snapshots make it easier to capture a point-in-time image of a consistent state of the application instantaneously. This makes it possible for an application to periodically checkpoint its state before initiating the next set of business logic transformations. In case of any failure or crash, an application can be made to restart from the last saved snapshot of the consistent application state. This is guaranteed to prevent the application from getting into an unwanted and inconsistent state with respect to its data and metadata.

Protection against Failed Transactions

Snapshots offer a mechanism for transaction-aware applications to checkpoint the application state before initiating any data transformation. In case of any transaction failure or business logic failure, the application can rollback to the last good snapshot of the system.

Protection against Ransomware and Malware State

Apache Ozone Snapshots are inherently immutable and read-only. A good data lifecycle management (aka DLM) framework can leverage it to protect their business assets against malware attacks. In case of such attacks, DLM can decide to restart the whole system and feed it the last known good state through the Snapshot APIs.

The figure below illustrates how Apache Ozone Snapshots can be used to address the above mentioned use cases.

Snapshots for Time Travel

More recently, time travel across an application data set has become an emerging use case. Snapshots make it easier to preserve various consistent states of the system at different times. Furthermore, all these states of the system are preserved by snapshots in a very space-efficient way. All the snapshots of the system share the common data across them and only record the differences.

An application-suite can leverage these snapshots to carry out business logic as of a specific timestamp. Please note that a snapshot represents a point in time image of the object store bucket that is application consistent. Snapshots can be taken at any arbitrary time as long as the application state is consistent at that instant of time.

An application suite can further leverage snapshots and snapdiff to carry out incremental analytics using the snapdiff mechanism that is offered by the Snapshot feature. Snapdiffs make it possible to precisely and very efficiently identify the set of objects and files that have changed between the two snapshots. This makes it possible to perform incremental analytics of the system state, very efficiently, even on a multi-petabyte data suite.

The figure below illustrates how Apache Ozone Snapshots can be used to do time travel between data sets.

Snapshots for DR and Remote Replication

Remote data replication is one of the key ingradients of data lifecycle management. It provides business continuity even in the presence of a disaster. This has two key elements

Stable Source

As we are replicating from a data source, the data source itself must remain stable. Think about the case when a large file is getting replicated, and in the middle of replication, it gets truncated. Similarly, think about the case when a large directory is getting replicated while some applications on the source side are deleting files from that directory.

Snapshots make it easier to capture the replication source at any given point in time and then use it to transfer the stable data to a remote destination.

Delta Replication

A replication source is always changing. To keep the remote replica in sync with the source, we need to keep replicating at regular intervals. An efficient replication mechanism typically involves identifying differences between the last replication cycle and the new one, and only transferring the delta changes between the two.

Snapshots offer a feature called snapdiff that can very efficiently identify changes between the two snapshots of the system. This highly efficient snapdiff computation mechanism takes time that is proportionate to the amount of churn between the two snapshots. This is very helpful in the case of large multi petabyte datasets. If the change between these multi-petabyte datasets is limited, let’s say to only a few GBs, the snapdiff computation would proportionately complete in a few seconds.

This is all illustrated in the picture below

This is extremely important in achieving an optimal RPO (aka recovery point objective) for businesses. This is because it allows for the replication cycle to be executed frequently.

Moreover, efficient snapdiff is also important in achieving an optimal RTO (Recovery time objective) for businesses. This is because the recovery mechanism needs to only identify the changes since the last known good state of the system. An efficient restore mechanism can go through these delta changes only, to get the system back to the last known good state.

Snapshots for Archival and Compliance

Snapshots offer an efficient storage mechanism for various checkpointed states of the data set. All the common data between multiple snapshots is shared and only one physical copy of this shared data is stored. This allows organizations to efficiently archive copies of the data set as per the data lifecycle management policy. For long-term data retention, these copies can also be moved to a different location and cheaper data storage tiers.

Snapshots for Incremental Analytics

Often organizations are dealing with multi-petabyte datasets. Running frequent analytics on such a large dataset size is a waste of compute resources, network, and time. This also directly impacts the budget of an organization.

Snapshots and Snapdiffs make it possible to run delta analytics only on the part of the multi-petabyte dataset that changes. This offers an efficient and economical option for organizations to deal with their big data.

Snapshots for Generative AI

Generative AI applications often need to deal with large datasets. When these datasets change, the training models have to account for it. Snapshots and Snapdiffs make it easier for the training models and engines to account for the delta changes instead of having to iterate on the whole data set all over again.

What’s next?

The next blog in this blog series will cover detail about using the snapshot feature for Apache Ozone object store. It will also act as a quick start guide for Apache Ozone Snapshots.