Introducing Apache Ozone Snapshots

Prashant Pogde
3 min readJun 24, 2024

--

What is Apache Ozone?

Apache Ozone is a highly scalable, highly available, distributed, and secure object store that can handle billions of keys. It is a fully N-way replicated and strongly consistent system that offers both Object-Storage as well File-System semantics. Apache Ozone doesn’t have any single point of failure either for the metadata or the data. It is compatible with Amazon S3 APIs as well as Hadoop Compatible FileSystem (HCFS) interfaces. It integrates seamlessly with YARN, Hive, Impala, Spark, and other compute engines out of the box. It is the preferred choice for storage on-prem at large enterprises for analytics and machine learning workloads. It’s also gaining accelerated adoption for a variety of use cases including backup, storage, archival, and incremental scale-out storage.

Apache Ozone is architected to scale horizontally as well as vertically, which makes it well-suited for large and growing datasets. It can accommodate dense nodes with 500+TB of storage attached to a single datanode. This allows for cost optimization on top of query optimization.

Having been built for modern, large enterprises that put an emphasis on data isolation and data security, Apache Ozone has security and isolation features baked into its core architecture. It offers data encryption at rest with unique encryption keys per object as well as strong access controls. This helps to ensure that sensitive data is protected from unauthorized access at all times.

What are Snapshots?

Snapshots allow storage administrators to take an app-consistent and point-in-time image of the underlying storage containers e.g. a file system, a volume or a specific directory. This snapshot is a read-only and frozen (in time) image and that can drive several use cases. These include data protection, backup/restore, archival, compliance, etc. The traditional file systems have offered a snapshot feature for some time. However, such a feature was non-existent for object stores until now.

Introducing Apache Ozone Snapshots

With the release of Cloudera Data Platform CDP 7.1.9, Apache Ozone now offers the snapshots feature. Apache Ozone is going to be the very first object storage system that offers this capability at bucket-level granularity. Some of the salient features of such capability include

  • Instantaneously capturing the point-in-time image of the entire bucket
  • Ability to read/restore from these point-in-time snapshot images
  • Inherently read-only and isolated images from the active object store
  • Ability to safeguard against malware or any other corruption of active object store
  • Ability to delete snapshots out of order
  • Efficient background space reclamation from deleted snapshots
  • Ability to efficiently identify differences between two snapshots of a given bucket aka Snapdiff capability.

Apache Ozone Snapshots and Replication Manager

With the release of Cloudera Data Platform CDP 7.1.9, we introduced a tight integration between Ozone Snapshots and the CDP replication manager. This would allow an efficient sync-up between multiple CDP Ozone clusters leveraging the snapshots and snapdiff capabilities.

More on Apache Ozone Snapshots

This is the very first blog in the Apache Ozone Snapshots blog series. Subsequent blogs in this series will discuss further details in depth.

Next in the Series:

Object Stores: The Case for Snapshots vs Object Versioning

Exploring Apache Ozone Snapshots

Apache Ozone Snapshots : Addressing different use cases

Apache Ozone: Using the Snapshot feature

--

--