An exercise in Discovery, Streaming data in the analytical world. -Part 1

4 min readAug 11, 2024

Apache Kafka, Apache Flink, Apache Iceberg on S3, Apache Paimon on HDFS, Apache Hive with internal standalone metastore (DerbyDB), external on PostgreSQL & on HDFS.

(11 August 2024)

Overview

A hair brain idea… with good intensions that ended with “some…!!!” scope creep and lots of learning along the way.

This is as informal as I did it, it’s a blog (or make that will be blogs, plural), a bit longer than the normal, because of all the things done and learned. Reading time? well varies. I read slow and others read fast, enjoy the time; I promise from my view point it will be of value.

Well originally this started as a very simple idea, lets create some data, publish it onto some Kafka topics, sink that into a MongoDB Atlas database/collection and then utilize the new Mongo stream processing to extract some value via aggregations and push (emit) that back onto Apache Kafka topics… to be displayed in terminal windows via simple Python consumers… other words an end to end flow. Well, that was the original concept sold to MongoDB Creator Community.

First, I discovered realized that due to work that this will take significantly longer than the 1 month of free Confluent Cloud access/credits, so the plan pivoted to deploying Confluent Platform locally via docker-compose.

I wanted the data to be used to have association and relevance and not simple fake random data so a small Golang (picked the language just because) application was created that constructed the source data from provided seed data/options. Note to full time coders, I am aware of various improvements that could be made, I know it can be split into a basket creator and a separate payment creator and deployed as individual containers themselves… The app was not the intent of the project so allow me some peace ;)

The concept, simulate a day in the life of a store, following the well-known shopping basket and payments example.

create a basket (constructed at a store selected at random from set of stores defined in seed file), comprised from random number of items (selected from seed file), random quantity of each item, once constructed the basket is posted onto a salesbaskets topic and then
create a salespayment record, associated with the basket, posted onto a separate salespayments topic.

At this point we have 2 input streams, simulating a store of some kind, idea, move the data along, create some real time aggregations, sink it into a long-term persistent data store (MongoDB), with further enrichment, Dashboard/Chart the various values, and make it available for downstream consumers.

Added some more scope creep… How about using Apache Iceberg tables on S3, writing the data out using Apache Parquet as OTF (Open Table Format), hosted on AWS S3 compatible object storage (provided locally using a MinIO container).

Also included, or make that required is a catalog store, provided by Apache Hive (with a Metastore standalone and backed by a PostgreSQL DB). Other words the current HOT topic, go to Lakehouse architecture.

But then, more scope creep, it never ends. I decided to have a look at Apache Paimon table format, utilising HDFS for storage, storing the data as either avro, parquet or orc, we now talking streaming lakehouse also known as streamhouse. Very new!!!

See The majesty of Apache Flink and Paimon for a nice writeup.

Note: there is a Master README.md in the root directory and most of the sub directories have README.md’s also, they include further explanation and notes that might be helpful/insightful or well, even perhaps confusing too.

This document is built on top of work by allot of work by other people. Their work lays the initial groundwork, introduction into the various technologies used.

This document by no means takes anything away from their amazing work, this was purely me taking it further, for my own benefit/education.

Where desired/required I expanded by making my payloads more complex to see what impact that has (i.e. the unnesting step of the salescompleted basket to flatten it for additional additional aggregations).

As it stands this will be a 6-part posting, but it’s by no means complete.

Good luck, this is all fraught with rabbit holes, so many and you can disappear so easily…

Edit: this is purely a exploration of the various technologies in the article (starting from the original MongoDB Atlas + ChangeStream intend), it is not a template for a target project as there are bucket loads of overlap in what the tech can do… and that was part of the discovery. where is overlap and how good/bad/complicated/problematic is it.

See my GIT repo for the entire document and code/article.

About Me

I’m a techie, a technologist, always curious, love data, have for as long as I can remember always worked with data in one form or the other, Database admin, Database product lead, data platforms architect, infrastructure architect hosting databases, backing it up, optimising performance, accessing it. Data data data… it makes the world go round.

In recent years, pivoted into a more generic Technology Architect role, capable of full stack architecture.

George Leonard

georgelza@gmail.com

An exercise in Discovery, Streaming data in the analytical world. -Part 1

Apache Kafka, Apache Flink, Apache Iceberg on S3, Apache Paimon on HDFS, Apache Hive with internal standalone metastore (DerbyDB), external on PostgreSQL & on HDFS.

Written by George Leonard