How to configure Cleaner configs with Apache Hudi

Sivabalan Narayanan
5 min readOct 31, 2022

--

Many times, we have seen support requests from users in the community that they are seeing FileNotFound issue while queries are executing with Hudi. Oftentimes, users would have missed, configuring their cleaner configs based on their needs. In this blog, we can discuss how to configure them and avoid such query failures.

Hoodie Cleaner

Hoodie Cleaner is a utility that assists in bounding your storage costs by reclaiming space. Apache Hudi provides snapshot isolation between writers and readers by managing multiple files with MVCC concurrency. These file versions provide history and enable time travel and rollbacks, but it is a trade off between how much storage space you can afford at the cost of supporting time travel way back in the past.

You can check out more info on Cleaning here.

But one of the common questions we often see in our support channels is, users see FileNotFound issue with queries for Hudi tables and how to mitigate them. Most likely the cleaner configs were not tuned based on their needs and hence they might end up with FileNotFound issue.

Let’s take a look at what cleaner actually does to your data files before diving into configuring the right way.

We have a few different cleaner policies in Apache hudi, and for explanation purposes, let’s go with KEEP_LATEST_COMMITS.

“hoodie.cleaner.commits.retained” is set to 2.

Every commit touches a subset of the file groups. Some could result in new file groups and some might update existing file groups.

Let’s go over an example scenario.

Commit C1:

Adds Data file 1 V1, Data file 2 V1, Data file 3 V1.

Commit C2:

Updates Data file 2 and so creates Data file 2 V2. and updates Data file3 and so creates Data file 3 V2.

Commit C3:

Updates Data file 2 and so creates Data file2 V3 and updates Data file 3 and so creates Data file 3 V3. But also creates a new file group 4 and so creates Data file 4 V1.

Here is a pictorial representation of data files after 3 commits.

After 3 commits, our cleaner might kick in and might remove some of the data files.

As you could see, after commit C3, cleaner kicks in and we have set value for “hoodie.cleaner.commits.retained” = 2. So, cleaner will first determine the earliest commit to retain which is C2 in this case (since there are 3 commits in active timeline and cleaner commits retained value is 2). Once determined, cleaner will try and clean up all data file versions created by commits < C2. Just C1 in this case. As seen in the picture, cleaner deletes Data file 2 V1 and Data file3 V1 ( since these file versions are created by C1). Please do note that Data file1 V1 is left as is, since there are no newer file versions for this file group.

Let’s add 1 more commit and see what cleaner does after that. Commit C4 adds a new version to Data file 2, data file3 and data file4.

This time, cleaner deletes 2 files that was created with Commit C2, leaving 2 commits (C3, C4) untouched in the active timeline.

How to configure your cleaner configs so that your queries don’t hit FileNotFound issue

For this, you may need to know what’s the maximum time any query would take for those hitting the table of interest. This might vary depending on your table scale and query predicates involved.

For eg, your table gets a new commit every 5 mins(commit throughput). And let’s assume that the maximum time any query would take (right from starting to read to the query completion time) for the hudi table of interest is 1 hour(exaggerating so that it’s conducive to explain).

This means, that if a query starts at Time t100, it could complete at t160. By then, your table could have gotten 12 new commits ingested. In this case, you might want to set your “cleaner.commits.retained” value to 12+1 (buffer). This means that, cleaner will never touch any file versions created in the last 1 hour and 5 mins. And this will ensure any of your queries will not result in FileNotFound Exception. If you configure “cleaner.commits.retained” to 10(default value), a query that started by t100 could result in FileNotFound exception between t150 to t160 since a cleaner that gets kicked in at time t155, could delete data file versions created at time t95, t100, t105.

So, you need to know two information in order to determine the right value for “hoodie.cleaner.commits.retained”. What’s the commit throughput for your table and what’s the maximum time a query can take for your table.

Incremental queries

If you have consumers issuing incremental queries, you may need to consider that as well while configuring your cleaner configs. For eg, if you wish to go back for 1 day with your incremental queries considering the commit throughput of 5 commits. We may need to set a minimum of (12 * 24) = 288 + commits pertaining to maximum time for a single incremental query (say 5 mins) => 289. If you set any value lesser than 289, your incremental query might also hit FileNotFound issue. There is some nuance here wrt archival configs as well. For instance, if archival has already archived some commits, incremental query may not be able to do incremental query on those in the first place. But anyways, I hope you got the gist of how cleaner configs interplays w/ your snapshot queries and incremental queries.

Conclusion

Hope we were able to demystify how to set the right cleaner configs so that you don’t hit any exceptions with queries reading from Hudi tables.

--

--