An exercise in Discovery, Streaming data in the analytical world. -Part 5

3 min readAug 15, 2024

Apache Kafka, Apache Flink, Apache Iceberg on S3, Apache Paimon on HDFS, Apache Hive with internal standalone metastore (DerbyDB), external on PostgreSQL & on HDFS.

(15 August 2024)

Data Persistence

Well, once we’ve calculated everything, we now want to store the data.

One way is to publish all data back onto the Kafka cluster into topics and then utilize Kafka Connect framework, with this we can push any value/topic from Kafka into a data store like MongoDB Atlas.

Another option, and this is more for the data warehouse/data lake/lake house/analytics world, push it directly from Apache Flink.

To explore this further, I ended doing 4 additional mini PoV’s.

Apache Kafka => Apache KSQL => Connect Framework => Storage
Apache Flink pushing into Apache Iceberg tables store with storage provided on AWS S3 via a local MinIO container.

2. Apache Flink pushing into Apache Paimon based tables with storage provided on Apache Hadoop DFS (HDFS) via local Hadoop cluster deployed via containers.

For all the file format can be selected as either avro, parquet or orc.

Lesson: Little Catch here, you can create a default file format by specifying it in the catalog create (simplifies the CTAS statements) or you can additionally specify it at table create time.

As we’re working on Apache Paimon, Apache Parquet is generally accepted as the “industry” default Open Table Format (OTF).

But but but… Parquet does not seem to handle complex JSON objects… don’t know if this document anywhere, just figured it out by illumination/trial and error. Avro and Orc seem to work… All 3 formats did fine with a flat table. — TO BE CONFIRMED!

See my GIT repo for the entire document and code/article.

About Me

I’m a techie, a technologist, always curious, love data, have for as long as I can remember always worked with data in one form or the other, Database admin, Database product lead, data platforms architect, infrastructure architect hosting databases, backing it up, optimizing performance, accessing it. Data data data… it makes the world go round.

In recent years, pivoted into a more generic Technology Architect role, capable of full stack architecture.

George Leonard

georgelza@gmail.com

An exercise in Discovery, Streaming data in the analytical world. -Part 5

Apache Kafka, Apache Flink, Apache Iceberg on S3, Apache Paimon on HDFS, Apache Hive with internal standalone metastore (DerbyDB), external on PostgreSQL & on HDFS.

Written by George Leonard