How to get your data into Apache Hudi

Sivabalan Narayanan
5 min readDec 18, 2022

As we all know Apache Hudi provides abstraction over your cloud storage to write transactional Lakehouse tables. In this blog, we are going to cover different ways you can write Hudi tables.

Engines

Apache Hudi supports different engines on the writer side. As of this writing, you have support for Spark, Flink and Java. Maturity or feature set might slightly differ from one engine and so please watch out for it when testing out any features.

Spark Engine

This is how Apache Hudi started at Uber with Spark write and likely any new feature will be built on spark engine and tested out before expanding to others.

Within spark, we have 4 different ways to write Hudi tables.

1. Spark datasource writer

One of the most common ways to write to Hudi. This methodology should be well known for spark users. If you have the dataframe df with you, you can write using

df.write.format("hudi").options(**hudiOptions).mode(Append).save(path)

This is mostly operated in the batch mode and users typically schedule their ingestion via some other external scheduling framework. For eg, you can schedule your ingestion pipeline once every 15 mins which will consume data from some other source and ingest into hudi.

Since this is batch mode or write once mode, you may not get any async table services with it. So, you might have to go with inline table services if you wish to leverage one.

2. Spark streaming sink

If you have a streaming source, you can still write to hudi using spark structured streaming sink in Apache hudi. As you might imagine this will be a continuously running spark application. The streaming sink will keep consuming data from streaming source and ingesting into hudi. Based on the options set, it might automatically trigger cleaning, archival etc. Also, you get async table services along with it which is one of the best offerings in Hudi. You just have a single spark job where in your regular writer will keep writing to your hudi table, but an async service will continue performing compaction/clustering as and when needed w/o impacting/blocking your ingestion. This is very critical to ensure you stream data faster into your lakes and also optimize the storage layout without the need to scheduling additional spark job. Because with additional spark job comes the additional burden of coordinating the writes/table services and hence lock providers are a must.

3. Deltastreamer

We have already covered this in one of my previous blog. Feel free to give it a read. This is one of the coolest platform services that Hudi offers. Its a self managed ingestion tool for Hudi. All you need to do is to point to your source and the tool will take care of keeping your lakehouse tables in sync with your source. It offers lot of flexibility from the deployment standpoint. For eg, it has write once mode or continuous mode. You can Schedule using an external framework at regular cadence or choose to run it continuously. If you have auto scaling enabled, you could even save cost by going w/ continuous mode but w/ min sync interval of 15 mins ish and avoid any external scheduling framework.

It has its down checkpointing mechanism. So, the tool guarantees no data loss not duplicates wrt the data ingested into Hudi. This also comes with async table services when you run in continuous mode and we know from your experience supporting the community, this has been very widely used by many and is battle tested.

4. Spark-sql writes

Apache Hudi added spark-sql dml support in 0.9.0. You can write to Hudi using spark-sql writes. It supports all the common sql syntaxes from CTAS (create table as select, Insert into, Merge into, Update, Delete). Many have started using this mode and especially with incremental ETLs, this has been widely used. For eg, if you want to design a silver/gold table consuming from N number of source tables, this could be handy. For one on one, you could use Deltastreamer as it had Hudi Incr source.

I have left out pyspark, but you can also use pyspark to write to hudi (atleast for spark-ds writes).

Flink Engine

Flink Integration was added to Hudi in 0.7.0. From there, it has grown leaps and bounds. Flink Hudi community is very vibrant and have contributed a lot to Hudi’s feature set and eco system. We really appreciate Hudi being embraced by everyone. Also, many on the flink landscape uses Hudi for their lakehouse since the flink support in other datalake or lakehouse systems are not available or not as rich as compared to Hudi.

There are two ways you can write using Flink

Flink Sql Client

This is most widely used one and which is available from 0.8.0. It supports create table and insert into (used for both insert, and update use-cases). You can refer to our quick start guide for an example.

2. FlinkStreamer

This is similar to Deltastreamer that we saw earlier for the spark engine. Its a auto ingestion tool for Apache hudi using Flink Engine. It is gaining popularity in recent times and would recommend you to give it a try.

Java Engine

Compared to the other two engines, the feature support matrix is on the lighter side for Java engine. We are looking for more contributors who can help us build more on the Java engine.

HoodieJavaWriteClient assists you to write to hudi tables(there is only one way to write to hudi using Java :) ). You can refer to some examples of usage here. COW should be reasonably caught up w/ other engines, but chances that MOR table support might miss some features compared to other engines. But one thing to call out here is, it’s the java write client that came in handy when we developed the Kafka connect support for Hudi.

Conclusion

Hope this blog gave a high level overview of different ways to write to Apache Hudi. This blog intentionally covered only the OSS ways. There could be more with commercial offerings. For eg, with AWS, there are more options like glue connector, etc which is not covered in this blog. but any commercial offering will boil down to one of the above write option.
Anyways, to summarize, depending on the engine or framework your organization operates with, depending on your requirements and workload types, you have many different options to choose from to wirte to Hudi. If you wish to contribute to Hudi, please check out here. We are always looking for more contributors.

--

--