Table service deployment models in Apache Hudi

Sivabalan Narayanan
5 min readFeb 12, 2023

--

In one of our previous blog, we saw what are the different ways we can ingest to Hudi. We also saw different table services available in Apache Hudi to assist with managing your storage layouts efficiently like compaction and clustering. Even though these blogs discussed the purpose of these table services and different strategies one can employ, we did not get a chance to cover deployment models. This blog is going to cover them. We will discuss different ways we can deploy these table services depending on operational complexity, managing compute resources etc to execute these services.

Different ways to deploy these table services:

  1. Inline
  2. Separate async service.
  3. Async service alongside continuous deltastreamer and streaming ingest.

Inline:

As the name suggests, the most simplest way to deploy these table services is inline. Let’s say you configure to run compaction every 5 commits, once every 5 commits, compaction will be triggered immediately after the commit is completed. Many users prefer this since this is the most simplest and operationally easier to manage. Making sense of timeline will be lot easier since everything is inline. Also, for beginners, it is easy to get started.

You can enable inline compaction and clustering for spark datasource writes, spark-sql writes and deltastreamer in syncOnce mode. Its not feasible to enable inline compaction nor clustering with continuous mode deltastreamer and streaming ingest.

For eg, you can use below configs to enable inline compaction with your writes.

hoodie.compact.inline=true
hoodie.compact.inline.max.delta.commits=4

Once every 4 commits, both scheduling and executing of compaction will be triggered inline.

Similarly, to trigger inline scheduling and execution of clustering:

hoodie.clustering.inline=true
hoodie.clustering.inline.max.commits=4

Once every 4 commits, both scheduling and executing of clustering will be triggered inline.

Note: I am not including other required params like strategy params, small file, target file size etc as the focus of this blog is around deployment models and not different strategies.

You may need to set these configs as part of your regular writes and so once every Ncommits, these table services will kick in.

Separate Async service:

As you might have inferred, inline will take a hit on your write latencies depending on the cadence w/ which they get executed. If you want to guarantee any strict SLA on your write performance, then you might want to consider going for async service.

Also, some advanced users might want to separate out regular writers and the table services like compaction and clustering so that they can manage resources efficiently. If not, you might have to allocate more resources to your writers to accommodate the table services which might execute only once every N times on which case might end up paying more for your resources as it may not be used all the time.

Hudi offers a standalone async service to cater to such users. Its a separate spark-submit job in itself whose sole purpose is to manage these table services.

Compaction:

You can make use of HoodieCompactor that comes with the utilities bundle. Sample command can be found here.

Clustering:

You can make use of HoodieClusteringJob that comes w/ the utilities bundle. More info can be found here.

You may need to configure lock provider configs for these standalone services and also your regular writers so that they coordinate among themselves. If not, it is an operational burden to ensure your write pipeline is down when your standalone table services are running which in general might get tricky if you are managing 100s of tables.

Async service alongside continuous deltastreamer and streaming ingest:

If you are using continuous mode in deltastreamer or streaming ingest to write Hudi tables, Hudi offers yet another option to make your life easier. How about you can execute these table services in an asynchronous manner, w/o impacting your regular writes, but don’t need to spin up a lock provider nor manage a separate spark job. How cool is that? Yes, if you are running deltastreamer in continuous mode nor using streaming ingest to write to hudi, Hudi offers to execute these table services asynchronously within the same process, but in a separate thread. From our experience w/ the community, many users love this and it just works w/o much hassle.

Compaction:

By default async compaction is enabled for MOR hudi tables if you are using deltastreamer continuous mode or streaming ingest. You just need to set additional params in your property file like triggering strategy, execution strategy etc.

hoodie.compact.inline.max.delta.commits=4

You can add this config in your property file so that async compaction will kick in once every 4 commits.

Clustering:

Similar to compaction, you can enable async clustering along with your continuous mode deltastreamer or streaming ingest.

You need to add this config while starting your spark submit job deltastreamer continuous

--hoodie-conf hoodie.clustering.async.enabled=true
--hoodie-conf hoodie.clustering.async.max.commits=4

Incase of streaming ingest, you can set the config along w/ your streaming job.

option("hoodie.clustering.async.enabled","true")
option("hoodie.clustering.async.max.commits","4")

You don’t need to set up a lock provider if you have only writer. Hudi automatically enables InProcessLockProvider on such cases(async service running in a separate thread but within the same process). But if you wish to set up a lock provider, you are most welcome.

Schedule inline, Execute async:

If you are using a standalone spark submit job to schedule and execute compaction, while your writers are writing concurrently, occasionally you might find your scheduling fail, bcoz, there are certain conditions that needs to be met while scheduling compaction and it might be tricky if you have a concurrent writer running alongside your async table service. So, if you are using spark data-source or spark-sql writes and prefer to execute compaction asynchronously, we do have an additional option for you. You can enable inline scheduling along with your write pipeline and only delegate execution to your standalone spark job. In these cases, there won’t be any issues while scheduling compaction since its done inline and hudi programmatically knows when exactly to schedule them so that all constraints are met.

You may need to set below configs to achieve this

hoodie.compact.inline=false
hoodie.compact.schedule.inline=true
hoodie.compact.inline.max.delta.commits=4

While executing HoodieCompactor, you need to set “ — mode execute” so that only execution is taken care by this standalone job since scheduling is done by the regular writer.

Conclusion:

Core creators of Hudi has always been mindful of different set of users w/ different requirements around managing such table services. Hence Hudi offers 3 to 4 deployment models depending on your needs and requirements. Hope this blog gave your a good illustration of different ways you can deploy table services with Apache Hudi.

--

--