Cleaner and Archival in Apache Hudi

Sivabalan Narayanan
6 min readJun 11, 2023

--

We have already covered the purpose of cleaner and how to configure the same in this blog. Cleaner goes hand in hand with archival and it might be useful to understand the under pinnings of the same to better operationalize Hudi. Hudi is generally credited for its platformized components like automatic cleaner, automatic file sizing, etc. For basic users, it may not be required to tweak any of these configs, but for advanced users, to better operate Hudi at large scale w/ 100s or 1000s of pipelines, its better to understand in detail. So, let’s do a deep dive into the cleaner and archival with Apache Hudi.

Timeline

Hudi maintains a timeline of actions in “.hoodie” directory under the table base path. “Timeline” as the name suggests, maintains the sequence of events that happens in a table. It will track the different actions like commits, rollbacks, savepoint, replace commits etc. Each action can go through different states like requested, inflight and completed. By listing the “.hoodie” directory you can get a sense of whats going on for a given table. Hudi-cli has lot more options to slice and dice the timeline to visualise in different ways as needed.

Here is a simple illustration of Hoodie timeline

We have 3 commits in the timeline. First one is at time t1 and is completed, 2nd commit is at time t20 and is completed. 3rd commit is at time t30 and is in progress. Probably we can discuss about timeline sequence of events in detail in some other blog, but for the purpose of this blog, this should suffice.

Cleaner

New commits updating existing data results in newer versions of data files. As more and more updates are ingested, more versions of data files are created. Cleaner takes care of deleting the older versions which may not be required anymore. Rational to retain more than 1 version is for incremental query and for time travel queries. Users also main versions for backup purposes, so just incase of any data corruptness or some mis-steps in data ingested, we can restore to older versions. If a user does not have any these requirements, we can retain just 1 version of data. Whatever is configured for the number of versions to retain, cleaner is responsible for deleting the older versions of data files beyond that.

So, lets say the timeline for a hudi table is as follows:

We have 5 commits where a new commit happens every 5 mins. And for illustration purpose, we can assume cleaner is configured based on num_commits(more details here). And is configured with “4” as “num commits to retain”. So, cleaner will kick in and will clean up data files pertaining to any commits except the last 4.

Please do note that, cleaner only cleans up the data files and not timeline files. Which means, just by looking at the timeline (“.hoodie”) we may not be able to easily comprehend which is already cleaned up and which are yet to cleaned up. Even in above scenario, the timeline might still show “t1” commit in the timeline, but any incremental query or a time travel query might fail if we attempt involving “t1” commit.

As and when new commits happen, cleaner will keep attempting to clean. So, after a new commit t35, cleaner will clean up data pertaining to commit t5. And if there is a new commit t45, cleaner will clean up data pertaining to commit t10.

Archival

Timeline in Hudi is categorized into active timeline and archival timeline. Since the active timeline is consulted by read queries to understand valid data files to serve, we want to keep the bound the active timeline. If not, it might incur latency to read latency. So, after a certain threshold, timeline events are moved from active to archived. That’s where the archival timeline comes into play. In general, archival timeline is never contacted for regular operations of the table. It is just for book-keeping purposes and used only during investigations. In reality, there is no need to maintain more commits in the active timeline beyond whats cleaned up. But for now, we have separate set of configs for cleaners and archival. Configs of interest for archival are “hoodie.keep.min.commits” and “hoodie.keep.max.commits”. These two works like a window, where in, when the number of commits in active timeline reaches the max value configured, archival will chime in to trim it down to min value configured. For eg, if you have configured (6, 10) for the min and max commits to keep : whenever the active timeline grows to 10 commits, archival will moved the earliest 4 commits to archival timeline, leaving only 6 commits in the active timeline.

Let’s walk through an illustration to simplify our understanding.

Here is a sample table’s timeline and lets assume we have disabled cleaner and archival as of now.

Table has 1 commit for every 10 mins and table has 15 commits accrued so far.

Here are the configured values:

Cleaner commits retained:5

Min commits for archival: 10

Max commits for archival : 14

Lets enable cleaner and see what happens. There are 15 commits in the timeline and cleaner will cleanup data pertaining to commits except the latest 5 commits.

Now let’s invoke archival and see what happens.

Configured values are (10, 14) for min and max commits. So, since the entries in active timeline is > 14, archival will trim them down to 10 commits. And so, archival leaves the last 10 commits(t60 to t150) and archives the rest. (1 commit every 10 mins). You can find a directory called “archived” under “.hoodie” where all the archived entries are stored. Archived entries are stored within log files. Each log file by defaul contains 10 entries. You can play around with configs like “hoodie.commits.archival.batch”.

So, to summarize the current state of things showing the archival, cleaner and uncleaned data:

On every new commit, cleaner will cleanup older versions of data if applicable. But archival might kick in only when active entries reaches 14(4 more commits). You can also configure cleaner to run only once every N commits using “hoodie.clean.max.commits” config.

Just a reminder, that cleaner takes care of deleting the data files (without touching any timeline files) and archival takes care of cleaning up(/or rather archiving) timeline files (without touching any data files).

Conclusion

Cleaner and Archival are background services and takes care of cleaning up older versions of data, cleaning up partially failed commits, keeping timeline in bounds to ensure read latencies are not affected. Understanding how to configure these and how they work in detail might come in handy when operationalizing Hudi at large scale.

--

--