Curious case of DefaultHoodieRecordPayload vs default payload class in Hudi

Sivabalan Narayanan
5 min readJan 22, 2023

--

In Hudi you can configure a payload class for a given Hudi table as per your choice. It is used to merge two versions of the same record during updates. Let’s look under the hood to understand the purpose of payload class and what all different ways one can use.

Config to use: hoodie.datasource.write.payload.class

Note: With the new record merge API initiatives, these might change. So, this payload class details are applicable to all versions until 0.13.0 of Hudi. Future releases might deprecate this.

Payload class

Hudi has a payload class interface which will determine how two versions of the same record are merged together.

Excerpt of the interface that’s of interest to us :

/**
* This methods lets you write custom merging/combining logic to produce new values as a function of current value on storage and whats contained
* in this object. Implementations can leverage properties if required.
* <p>
* eg:
* 1) You are updating counters, you may want to add counts to currentValue and write back updated counts
* 2) You may be reading DB redo logs, and merge them with current image for a database row on storage
* </p>
*
* @param currentValue Current value in storage, to merge/combine this payload with
* @param schema Schema used for record
* @param properties Payload related properties. For example pass the ordering field(s) name to extract from value in storage.
* @return new combined/merged value to be written back to storage. EMPTY to skip writing this record.
*/
Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema, Properties properties) throws IOException;

Hudi internally represents a record as HoodieRecord which consists of a pair of HoodieKey and HoodieRecordPayload. HoodieKey as we have seen in previous blogs, represents primary key for a record (typically, partition path and record key). HoodieRecordPayload is the actual data passed in by the user.

Let’s walk through a typical example. You ingest 2 records in commit1, namely, {HK1, payload1_1} and {HK2, payload2_1} in commit1(HK -> HoodieKey). In commit2, let’s say you ingest {HK1, payload1_2} and {HK3, payload3_1}.

Since we are seeing an update for HK1, hudi will have to merge the two payloads (payload1_1 and payload1_2 to produce the final output for HK1. That’s where the combineAndGetUpdateValue() shown above comes into play.

Essentially, HK1.payload1_2.combineAndGetUpdateValue(HK1.payload1_1) deduces the final value for HK1 at the end of commit2.

With that context, let’s dive into some of the payload implementations Hudi offers. The default payload class is called OverwriteWithLatestAvroPayload.

OverwriteWithLatestAvroPayload

As the name(link) suggests, when this payload class is used, we just override any existing value w/ the new incoming value. So, in the case of above example, payload1_2 will be the final value for HK1 once commit2 completes. This is the most simple payload Hudi offers and has worked out well for most users in the community.

DefaultHoodieRecordPayload

We also have a payload class called `DefaultHoodieRecordPayload`. Compared to OverwriteWithLatestAvroPayload which was available in Hudi right from the beginning, this DefaultHoodieRecordPayload was introduced 1.5 years back. Lets take a peak into whats special about this payload class.

In general, you can configure preCombine field for a Hudi table. More info can be found here. Briefly, preCombine field is used to resolve winner among two versions of the same record among the same batch. For eg, if you ingest {HK1, payload1_1}, and {HK1, payload1_2} in the same batch while writing to Hudi, Hudi will dedup the incoming records before it routes internally. So, the preCombine field value will determine the winner among the multiple versions on such cases.

For eg, you can elect “updated_at” field in your table schema as the preCombine field. So, if there are more than 1 record in the incoming batch with same HoodieKey, whichever record has higher preCombine value will take precedence.

Even though OverwriteWithLatestAvroPayload and DefaultHoodieRecordPayload might look similar, there is one key difference. It’s the way combineAndGetUpdateValue() is implemented. DefaultHoodieRecordPayload also honours the preCombine value while merging an incoming record with that’s in storage, while OverwriteWithLatestAvroPayload will blindly choose the incoming over anything that’s in storage.

Let’s add commit2 with an insert record (HK3, and an update value for HK1).

Both OverwriteWithLatestAvroPayload and DefaultHoodieRecordPayload updated HK1 with payload1_2. OverwriteWithLatestAvroPayload always chooses the newer incoming and hence payload1_2 was chosen. DefaultHoodieRecordPayload deduces based on preCombine field value. Since payload1_2’s precombine field value (20) is higher than payload1_1’s precombine field value(10), DefaultHoodieRecordPayload also chose payload1_2 as the final snapshot for HK1.

Now, let’s go for commit3 which updates HK1 with a lower preCombine value to mimic late arriving data.

OverwriteWithLatestAvroPayload chooses the new incoming payload irrespective of the preCombine value and hence it chooses payload1_3 as final value for HK1. But DefaultHoodieRecordPayload chooses the final winner based on the preCombine value and hence it chooses payload1_2 as the final snapshot value for HK1.

There are few other payload classes available for ease of use for the community. To name a few, we have OverwriteNonDefaultsWithLatestAvroPayload, AWSDmsAvroPayload, MySqlDebeziumAvroPayload, PostgresDebeziumAvroPayload etc. Interested folks can check out the respective classes for each of these payload classes.

Having such customizable options to merge two versions of the record gives great flexibility to the lakehouse users. If not for spark-sql writes (MERGE INTO), not many systems gives you this felixibility, but Hudi users are enjoying this right from start :)

Conclusion

Hudi strives to favour flexibility of end user as use-cases come in different shapes and sizes. This payload class is one such offering. You can also define your own payload class based on your requirement instead of confining to the ones provided by Hudi. Hope this blog is helpful to understand the payload class purpose, commonly used payload implementations. Catch you later in some other interesting topic.

--

--