Streaming Change Events Building Iceberg(s)

Tim Spann
Cloudera
Published in
5 min readDec 15, 2023

CDC, Debezium, GoldenGate, Change Data Capture, Real-Time Events, Apache Iceberg, Apache NiFi, Cloudera DataFlow, Cloudera Data Platform

Source: https://github.com/tspannhw/FLaNK-Ice/tree/main

Photo by Hubert Neufeld on Unsplash

Apache Iceberg is a high-performance format for huge analytic tables ideal for building Open Data Lakehouses. The Cloudera Apache NiFi PutIcebergCDC processor is capable of applying CDC (Change Data Capture) operations on Iceberg tables using Hive Iceberg catalog.

Supported operation types

  • c (Debezium create) or I (GoldenGate insert) — The record — specified in after field — will be inserted to Iceberg.
  • d (Debezium delete) or D (GoldenGate delete) — The record — specified in before field — will be deleted from Iceberg.
  • u (Debezium update) or U (GoldenGate update) — The record — specified in before field — will be replaced with the new content — specified in after field.
  • r (Debezium read) — Read records are handled as create records, those records are inserted to Iceberg.

Let’s start landing data to our Open Data Lakehouse. This is critical for many use cases and for powering our Generative AI applications.

NiFi to Iceberg

Depending on when your table is stored, you may need to set some permissions.

Here is an example:

To give a user write access to an Iceberg table, we need to do two things:

Create a Ranger policy that grants write access to the table object.

Create a Ranger policy that grants “RW Storage” access to the storage URL.

Now we need to do:

Storage type: iceberg
Storage URL: DBname/Table*, or
Storage URL: DBname/*

Cloudera Data Platform Group Rules

DataCatalogCspRuleViewer

DFCatalogAdmin

DFCatalogViewer

DFFunctionMachineUser

EnvironmentUser

Data to Iceberg tables
Querying Tables as Regular SQL Tables
Create your table

RESOURCES

https://www.youtube.com/watch?v=R2T6_eOnV8Y

https://www.youtube.com/watch?v=Q9Cys_N4iQQ

https://www.youtube.com/watch?v=aPSG8hmzbmc

--

--

Tim Spann
Cloudera

Principal Developer Advocate, Zilliz. Milvus, Attu, Towhee, GenAI, Big Data, IoT, Deep Learning, Streaming, Machine Learning. https://www.datainmotion.dev/