Introduction to REST Catalogs for Apache Iceberg
What is the Apache Iceberg Catalog?
While Iceberg primarily concentrates on its role as an open data format for lakehouse implementation, it still needs to use metadata to track its tables by name. The catalog acts as a reference and contains a pointer to the metadata file for a given table and provides atomicity. Different backends (e.g. Hive, Hadoop, AWS Glue) that can serve as the Iceberg catalog will store the current metadata pointer differently. Iceberg catalogs are flexible and can be implemented using almost any backend system. They can be plugged into any Iceberg runtime, and allow any processing engine that supports Iceberg to load the tracked Iceberg tables. Iceberg also comes with several catalog implementations that are ready to use out of the box.
This includes:
- REST: a server-side catalog that’s exposed through a REST API, such as Apache Gravitino or Apache Polaris
- Hive Metastore: tracks namespaces and tables using a Hive metastore
- JDBC: tracks namespaces and tables in the JDBC database
- Nessie: a transactional catalog that tracks namespaces and tables in a database with git-like version control
Catalogs are extremely useful at telling us where our Iceberg tables are, and subsequently how we can access them safely. They are the backbone of data governance frameworks and with regards to Iceberg, they are used for tracking tables and allowing external tools to interface with the metadata. For an Iceberg catalog to be production-ready, it must support atomic operations for updating the current metadata pointer. This helps ensure ACID compliance for table operations by making sure all readers and writers see the same state of the table at a given point in time. When there are two concurrent writers, it’s important to ensure that partial writes don’t happen, resulting in data loss.
How do catalogs work in Iceberg?
This metastore plays a central role in providing the source of truth for the metadata location of Iceberg tables and task execution like creating, dropping, or renaming tables. By grouping collections of tables into namespaces, the Iceberg catalog can keep track of each table’s current metadata for when you load in a specific table. It is important to note that the purpose of the Iceberg catalog is primarily technical, which means most of its functionality surrounds versioning, table management, and naming — this differs significantly from a product data catalog. To learn more about the difference between technical and product data catalogs, see “Technical vs Product Data Catalogs: Which one is best for you?”.
When you implement Iceberg in your setup, one of the first steps is to initialize and configure the catalog. These catalogs enable the SQL commands that will allow you to manage the tables and load them by name. The catalog will be configured by passing along certain properties to the processing engine at initialization. For instance, a Spark catalog would be set by passing spark.sql.catalog. The Iceberg catalog initialized will be specified based on this property, depending on where you are loading tables from. Examples are spark.sql.catalog.hive, spark.sql.catalog.rest, or spark.sql.catalog.hadoop to name a few. It is important to note that not all engines are configured the same way, so it’s best to always refer to the documentation for best practices.
Many pluggable services exist for the Iceberg catalog, such as Apache Gravitino or Polaris, which will utilize the REST catalog. Using REST as a defacto, these services are part of an effort to decouple the catalogs from their underlying technologies. This is important because many of the Iceberg catalog clients contain particular logic depending on the source, so moving it to the catalog server instead allows for more flexibility and control. Iceberg 0.14.0 introduced the REST Open API specification, allowing server-side logic to be written in any language and use any custom technology as long as the API followed the specification.
Evolution of Catalogs: From Apache Hive Metastore to REST
Generally, you will find Iceberg catalogs in two flavors: either a Service-based catalog or a File System Catalog. The majority of Service-based catalogs work by running a service that is either self or cloud managed, and use a backing store to maintain all the Iceberg table references, as well as any locking mechanisms to ensure ACID compliance and prevent conflicts. This type of catalog is becoming a more common trend over File System Catalogs, which use a file to track tables instead of a dedicated backing store. These types of catalogs, such as the Apache Hadoop catalog, are compatible with any storage system but as a result are more prone to inconsistencies due to the atomicity guarantee differences between the many types of storage solutions that exist, and are inherently split-state.
The Hive Metastore catalog was previously a widely used implementation for managing an Iceberg catalog due to the prevalence of the Hadoop ecosystem. It works by mapping a table’s path to its current metadata file using the location table property in the table’s entry within Hive Metastore. This property specifies the absolute path in the filesystem where the table’s current metadata file is stored. This means that the metastore needs to be synced to the file storage to avoid failures of the metastore being written but not the data, or vice versa. File system catalogs like the Hive Metastore have been used to manage Iceberg, particularly during migrations from Hive to Iceberg, while maintaining both systems. However, these catalogs often encountered locking issues and conflicts that required resolution. For example, when concurrent writes are occurring, the writer that first successfully acquires the lock will swap in its snapshot, while the second writer will retry applying its changes. However, locks may also be occasionally abandoned during a shutdown that failed to be cleaned up.
As a result, the Iceberg community began exploring alternative lock implementations, though there remained a desire for a less bloated solution than the Hive Metastore. JDBC-based catalogs work by storing the metadata in a dedicated table in the respective relational database it is connected to and then use that table to track changes and manage the iceberg tables. Depending on the implementation, JDBC-based catalogs can be more prone to inconsistencies due to differences in ACID compliance across various storage systems. To address these challenges, the idea of using REST was introduced, where engines would send HTTP requests to a REST endpoint, with conflicts handled server-side. This approach allows users to utilize Iceberg without needing an in-depth understanding of its intricacies.
An introduction to REST
A REST API conforms to the principles of the Representational State Transfer (REST) architectural style, making it compatible with RESTful web services. REST is not necessarily a protocol or standard but rather a set of architectural constraints that developers can implement in various ways. When a client makes a request via a RESTful API, it receives a representation of the resource’s state, which is delivered through HTTP in formats including but not limited to JSON, HTML, Python, or plain text.
Because it would be hard to accommodate everyone’s data infrastructure, and as languages like Java and Python have begun to co-exist, it is important that catalog implementations are consistent. It’s ideal to have a central place to manage all the metadata for all these clients to interact with Iceberg seamlessly. Especially when you have long-running jobs it needs to be backwards and forwards compatible in terms of iceberg versioning. REST catalogs were meant to solve this core issue by shifting a lot of the logic from the client side to the server side.
Apache Iceberg’s REST implementation
The API Definitions exist in official Iceberg documentation as a specification, but it is not actually implemented. This is what we consider REST compliance, and to fully make use of it we would have to build the service ourselves or use a catalog that provides the REST service for us, like Apache Gravitino. At the very least, such a service requires the implementation of a server, which will need to handle the requests, and a backend which delegates to the catalog to fulfill the requests. We can also extend the functionality of the API at will depending on our needs, which, when compared to Hive Metastore, is a more pluggable approach. The service implementing the REST catalog interface can choose to store the mapping of a table’s path to its current metadata file in any way it chooses. It could even store it in another catalog if it wanted to.
A REST catalog offers several advantages that make it an appealing choice for many organizations. First, it requires fewer packages and dependencies compared to other catalogs, which simplifies deployment and management. This simplicity is largely due to its reliance on standard HTTP communication. Additionally, the REST catalog provides flexibility because it can be implemented by any service capable of handling RESTful requests and responses, and the service’s data store can vary widely. Another benefit is its support for multi-table transactions, which allows for complex operations across multiple tables. Moreover, a REST catalog is cloud-agnostic, making it suitable for organizations that are currently using a multi-cloud strategy, or plan to do so in the future, or want to avoid cloud vendor lock-in.
However, there are also some disadvantages to consider. Implementing a REST catalog requires running a process to handle and respond to REST calls from engines and tools. In production environments, this often necessitates an additional data storage service to store the catalog’s state. Furthermore, there is no public implementation of the backend service to support REST catalog endpoints, meaning developers will need to create their own or opt for a hosted service. Another limitation is that not all engines and tools support the REST catalog, though some, like Spark, Trino, PyIceberg, and Snowflake, do at the time of writing.
In terms of use cases, a REST catalog is ideal if you need a flexible, customizable solution that can integrate with a variety of backend data stores, require support for multi-table transactions, or aim to maintain cloud agnosticism. When choosing a catalog, key considerations include whether it is recommended for production, whether it requires an external system and if that system is self-hosted or managed, whether it has broad compatibility with engines and tools, whether it supports multi-table and multi-statement transactions, and whether it is cloud-agnostic. Several examples of catalogs that follow the Iceberg REST Specification and are available for use out of the box are Apache Gravitino, Apache Polaris, Project Nessie, and Unity Catalog.
The tool with the largest open source ecosystem support and connectors is Apache Gravitino. Gravitino implements a metalake approach across data AI assets (although in the next release it will modularize its Iceberg REST service) and can also aggregate from other catalogs. Along with Iceberg, Gravitino also has native connectors to streaming sources, filesets, and relational stores and supports querying with Flink, Trino, Spark, or StarRocks. Learn more about Apache Gravitino here.
Apache, Apache Iceberg, Apache Hive, Apache Hadoop, Apache Polaris and Apache Gravitino are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.