Introduction to Apache Hudi

Andrew Savchyn
Blue Orange Digital
8 min readJan 9, 2024

Apache Hudi is an open-source data management framework that emerged to address specific challenges in handling large-scale data lakes. Developed by Uber in 2016, its primary focus is on optimizing the efficiency of updates and deletes in columnar file formats like Parquet and ORC, which are commonly used in big data ecosystems.

The framework introduces capabilities for fast, incremental data processing in environments where traditional data lakes are challenged by the need for frequent updates and deletions. These operations in conventional systems often lead to significant overhead due to the necessity of rewriting large datasets. Hudi’s approach to handling upserts (updates and inserts) and deletions enables more efficient data pipeline management.

Data Stream flowing into boxes on Data Warehouse shelves
Leveraging the efficient UPSERT capabilities of Apache Hudi transforms the experience of dealing with streaming data in a data warehouse

In the landscape of data management solutions, Apache Hudi is one among several, including Apache Iceberg and Delta Lake. Each of these frameworks has its own set of features and use cases. Hudi, for instance, is known for its record-level insert, update, and delete capabilities, as well as its timeline-based data management which facilitates point-in-time data snapshots.

Apache Hudi’s design also emphasizes compatibility and integration with existing big data tools and platforms. This attribute is particularly relevant in complex data architectures where interoperability can streamline data processing workflows and support various analytical requirements.

In this article, we explore Apache Hudi’s inner workings to uncover the technical foundations of its data management capabilities. Our focus will be on dissecting its architecture, core components, and operational mechanisms, providing insights into how it efficiently handles data processing challenges. The discussion will encompass both the basic and advanced features of Hudi, highlighting its application in real-world data scenarios.

Core Concepts and Features

At the heart of Apache Hudi’s architecture are two distinct modes of operation: Copy-On-Write (COW) and Merge-On-Read (MOR). These modes define how Hudi handles data writes and updates, each with its own approach to balancing performance and resource utilization.

Copy-On-Write mode mirrors traditional practices in managing data in columnar formats, where each update or insertion leads to the creation of a new file version. This approach ensures data immutability and consistency, yet also entails increased computational and storage demands on changing data due to the possible rewriting of entire files. It’s important to note that despite these drawbacks, COW can be highly suitable for certain data architectures. Specifically, environments that prioritize immutable data or those operating on inherently static data stand to benefit from this approach. In these contexts, the simplicity of operations and assured data integrity make COW a fitting and effective choice.

Merge-On-Read Mode in Apache Hudi presents an innovative approach for managing data with frequent updates. In this mode, Hudi handles incoming updates by storing them in separate delta files. When processing read queries, the response is compiled from both the immutable base files and the delta files. This methodology enhances write performance, as it avoids the necessity of complete file rewrites for every small update. It also enables seamless data modification: as far as the data reader is concerned, each update is applied in near-real-time data (with a few minutes of delay in the worst case). As such, MOR is especially beneficial for workloads with regular updates, such as data streaming workloads. The trade-off, however, lies in the complexity of read operations, since the process of ‘merging’ the delta and the base files during reads requires additional computational resources.

Apache Hudi’s support for multiple query types enables further optimization of data processing pipelines to suit specific use cases. A prime example is the implementation of Read-Optimized Queries, which exclusively operate on data files and bypass delta files. This configuration achieves a balance, resulting in a data store that is quick for both writes and reads. However, this efficiency comes at the cost of data freshness, as the most recent updates in the delta files are not immediately reflected in the read results. For scenarios where accessing the freshest data is crucial, and a slight compromise on performance is acceptable, Snapshot Queries are the more suitable option.

Apache Hudi offers two other notable types of queries: Incremental Queries and Time Travel Queries, each serving distinct data processing needs. Incremental Queries enable users to query only the data that has changed since a specific point in time. This approach is particularly valuable for efficiently building data pipelines that focus on processing only the delta, or changes, thereby saving computational resources and time. It’s ideal for scenarios like building out Change Data Capture (CDC) or real-time data synchronization. On the other hand, Time Travel Queries take advantage of Hudi’s ability to maintain historical data versions. This feature allows users to access and query data as it existed at any given point in time, providing a powerful tool for auditing, compliance, and historical analysis. Time Travel Queries enhance the data’s traceability and reliability, essential in scenarios where understanding data evolution is critical.

Architecture

But how does Hudi achieve its impressive feature set? To answer this, we must delve into Hudi’s internal architecture.

Timeline

At the core of Apache Hudi’s architecture lies the concept of the Timeline. The Timeline can be likened to a binlog in MySQL or a Write-Ahead Logging (WAL) system in PostgreSQL, as it records a list of atomic operations to be performed on the underlying data. Each entry in the Timeline is structured into three key components:

  • Timestamp: This marks the start time of a specific event, providing a chronological reference for each operation.
  • Action Type: This denotes the nature of the event. For example, ‘COMMIT’ represents any modifications in the data, such as inserts and updates, while ‘COMPACTION’ refers to a background operation that migrates updates from delta files into the data files.
  • State: This indicates the current stage of the action, with possible values being REQUESTED, INFLIGHT, and COMPLETED, which track the progression of each operation through the lifecycle of data changes.

The Timeline in Apache Hudi plays a critical role in preserving data consistency by maintaining a chronological record of changes. It enables snapshot isolation, ensuring that queries view a consistent dataset, even amidst concurrent updates. In scenarios requiring data rollback or when errors occur, the Timeline proves invaluable in reverting to prior states, with each state recorded and identified by a unique commit ID.

The performance of Hudi, especially in cases of Incremental Queries and Time Travel Queries, is closely tied to the length of its Timeline. Hudi typically scans through the entire Timeline to gather the necessary metadata for processing these types of requests. Consequently, response latency increases as the Timeline grows. To counter this, Hudi segments the Timeline into two parts: active and archived.

The Active Timeline contains recent entries that are directly relevant to current operations. This segmentation ensures that the system takes into account only the most relevant and recent changes, thereby optimizing performance. The active timeline is kept concise and limited to a manageable size (configurable), focusing on the current state and recent history of the data.

Conversely, the Archived Timeline consists of older entries that are no longer relevant for data operations but may be still valuable for historical analysis, audit trails, and debugging.

File Storage Organization

Data files in Apache Hudi are organized on the filesystem based on a configurable partitioning scheme. They can either be partitioned according to specific criteria or laid out flatly at a single level. Regardless of the organization, data from each table in Hudi maps to the filesystem through the following key concepts:

  • File Group: This represents a collection of all data files and delta files associated with a particular table, encompassing all versions of that table. Each File Group is identified by a unique value called File ID.
  • File Slice: A subset of a File Group, the File Slice essentially corresponds to a single version of the table. It comprises data files (also referred to as base files in Hudi terminology) and, optionally, delta files (sometimes called log files). To accurately identify a File Slice, both the File ID and Commit ID are required.

In addition to these files, Hudi manages an administrative .hoodie directory, which plays a crucial role in storing various types of metadata, including:

  • Timeline: Each item in the timeline is stored as a separate file within this directory, ensuring atomic updates to the timeline.
  • hoodie.properties: This file contains global settings for the Hudi dataset, such as the file formats used, the type of table employed, and other essential configuration details.
  • Indexes: Hudi maintains indexes for efficient querying, which are crucial for optimizing data retrieval and manipulation.
  • Auxiliary Files: Additional files for tasks such as compaction planning, cleaner metadata, and more, are also stored here, ensuring comprehensive management and tracking of various operations.

Indexing

This file storage structure alone is not enough to enable efficient updates. The system also needs a way to quickly find which files are affected by any particular update. To do that Hudi utilizes Indexes.

Each record in Hudi has a Primary Key, which consists of a record key (either a field in the data itself or auto-generated by Hudi) and data partition path. Indexes in Hudi map records the primary key to a File ID. As mentioned before, File ID is a unique identifier for a File Group, so this mapping effectively leads to File Group.

Hudi comes with several Index types built-in and also allows bringing your own implementation, which is useful on scale or in unusual environments.

Bringing the Pieces Together

To understand how Apache Hudi’s architectural choices translate into a performant and flexible system, let’s examine the handling of a UPSERT query.

The process begins with pre-combine operation. This step is crucial for environments where data delivery follows an at-least-once pattern, typical in high-throughput streaming frameworks. Precombine deduplication removes duplicates from incoming data, using a configurable key value. This step ensures that only a single instance of each record is considered.

Next, Hudi utilizes its Index to determine which files are affected by the UPSERT. The Index guides Hudi to the specific File Groups that need modification based on the Primary Key of the records being upserted. The methodology for modifying these files is dictated by the Record Payload implementation. This logic, which can be customized, defines the method for merging incoming records with existing ones. In Copy-On-Write (COW) tables, updates occur immediately, while in Merge-On-Read (MOR) tables, they are handled asynchronously in the background.

Finally, the process concludes with an update to the Index. This step incorporates the new records into the Index, maintaining the overall integrity and accuracy of the dataset.

Diagram illustrating following flow of operations: precombine, file ID lookup (performs lookup from index), data update (updates file slice), index update (updates index).
UPSERT Query Processing in Apache Hudi

Conclusions

Throughout this article, we have explored the intricacies of Apache Hudi’s architecture and its innovative approach to managing large-scale data lakes. Hudi’s unique features, such as Merge-On-Read table mode, efficient indexing, and flexible query types, collectively contribute to its capability to handle high-throughput data environments efficiently.

Apache Hudi stands out for its ability to balance performance, scalability, and data consistency, making it a compelling choice for organizations looking to optimize and standardize their data pipelines. The most powerful characteristic of Apache Hudi is its ability to deal with streaming data in a data analytical environment while ensuring data integrity and enabling real-time analytics, Hudi provides a comprehensive solution that addresses the evolving challenges of modern data architectures.

Links and References

--

--