Dev series on Apache Hudi : A Data lake technology

2 min readAug 22, 2021

Apache Hudi is a datalake technology that assists in building and managing your big data for analytical purposes. I have been associated with Apache Hudi for more than a year and thought of writing a developer series on the same.

Before we dive into the series, will go over some high level introduction on the same.

Apache hudi gives a storage abstraction over Hadoop compatible cloud stores like Hdfs, AWS S3, etc. It uses open format like parquet to store the actual data(also avro for delta logs). You can use it to store and process large analytical workloads scaling to 1000s of TBs and 1000s of tables.

Some of the cool features of Apache Hudi

Transactionality : Writers and readers can concurrently read and write w/o stepping on each others. Partial writes are never seen by any readers.
Incremental updates: You can apply just the delta to your dataset which could save you a huge amount of resources. Think about re-loading 200Gb of data vs just 5Gb every day.
First level CDC from database support with deltastreamer. Deltastreamer is a tool to fetch data from a source and apply updates to your dataset in Hudi. It supports both run-once mode and continuous mode.
Supports de-duping of events. Users don’t need to worry about duplications in their dataset.
Manages late arriving data in your pipeline.
Self manages small files which could turn out to be a bottleneck for keeping you read latencies under control.
Assists in GDPC compliance by deleting data with a bounded SLA.
Easier recovery of your data with savepoints and rollback.
Enforces schematization of data to ensure data is structured and queryable.
Has support for 3 engines, namely spark, flink and java.
Has 3 types of queries. Snapshot/realtime query, read optimized query and incremental query to cater to different needs of different users.
Point in time queries to go back in time and view your entire table snapshot.
…

List goes on and on. Feel free to read more on Apache Hudi on the official site. For the purpose of this developer series, I plan to use AWS EMR, but as called out earlier, Hudi can be used against any hadoop compatible cloud stores.

Dev series on Apache Hudi : A Data lake technology

Written by Sivabalan Narayanan