Global vs Non-global index in Apache Hudi

Sivabalan Narayanan
4 min readApr 3, 2023

--

Hudi employs indexing on the write path to detect updates vs inserts and to route updates to the same file group deterministically. Hudi supports different indexing options out of the box like bloom, simple, hbase, bucket, global_bloom, global_simple etc. There have been some confusion or mis-conceptions about regular indexes and global indexes that hudi offers and I would like to go over the them in this blog.

Regular/Non-Global Index:

As we all know, hudi has a notion of primary key for every table which uniquely identifies a record. A pair of partition path and record key uniquely identifies a record in a hudi table. Let’s walk through a simple example to illustrate.

P -> denotes Partition

RK -> record key

Commit 1 is straightforward. We ingest 2 records and both are served during snapshot after ingestion completes.

In commit 2, we ingest 2 records. 1 of them is an update, and another one is a new insert. Even though the record key (RK1) matches w/ an already existing record, existing record belongs to a different partition. Since a pair of {record key , partition path} is guaranteed to be unique, {P3, RK1} is considered a new insert record and hence in the snapshot read of hudi table (after commit2 succeeds), we get 3 records.

Global Index:

Even with partitioned dataset, users can prefer to keep their record keys unique across entire table (across partitions). So, for such cases, they have to choose one of the GLOBAL indexes hudi has to offer, namely GLOBAL_BLOOM or GLOBAL_SIMPLE. In these tables, just the record key is guaranteed to be unique across the entire table. As you can imagine, even if a record migrates from one partition to another, hudi will ensure to update the new partition and delete the record in the older partition.

Let’s go over the same example as above.

Commit 1 is straightforward. We ingest 2 records and both are served during snapshot after ingestion completes.

In commit 2, we ingest 2 records. 1 of them is a clear update. But incase of {P2, RK1}, hudi will deduce that its a GLOBAL_INDEX and already RK1 existing in partition P1. And so, hudi will delete P1, RK1 and insert the new incoming record to P2. And so, in snapshot read of hudi table (after commit2 succeeds), we get only 2 records and RK1 is migrated to P2(from P1).

Cons:

Global index does come at an additional cost though. Let’s take a decent sized table, where we have 1000 partitions and 1000 data files(file groups) in each partition. And lets say your new batch of ingestion has records from P990 and P999(2 partitions). Incase of regular/non-global index, during index look up, only file groups from matching partitions are involved. To be precise, only file groups from P990 and P999 are involved during index lookup, which is 2000 file groups in this case. Where as, incase of global index, since a record could migrate from one partition to another, hudi does not have an option, but to involve every file group in the table. So, in this case, 1M file groups will be involved during index lookup. So, you will definitely see performance difference b/w regular index and global index. So, its recommended to use global index having these in mind. If its feasible to go w/ regular index, thats the best possible option since the index look up is proportional to the amount of data only in the matched partitions from incoming records.

Lets say in commit1, we added 1000 partitions and 1000 file groups. In commit2, we are going to update 10k records in P990, and P999. Everything is an update and all of them belong to the same partition.

So, even if the record does not migrate (in this example), hudi does not know unless index lookup is complete and so it has to involve all file groups from all partitions for index lookup. But during actual writes, the number of file groups will be pretty much close to what a regular index would touch. Only difference is, if there are records being migrated, those file groups in older partitions might also be involved.

Conclusion:

Hope this blog spelled out the difference b/w regular and global indexes in Hudi and calls out when one can use each of them and what’s the cost one has to pay for global uniqueness.

--

--