Iceberg + Dbt + Trino + Hive : modern, open-source data stack

Stefentaime
4 min readMar 7, 2024

To provide a deeper understanding of how the modern, open-source data stack consisting of Iceberg, dbt, Trino, and Hive operates within a music streaming platform, let’s delve into the detailed workflow and benefits of each component.

Data Ingestion and Quality Assurance with Schema Contracts

The journey begins with data production, where every interaction on the music platform is captured as an event and produced to Kafka topics. Utilizing schema contracts ensures that all data adheres to a predefined structure, enhancing data quality and integrity from the outset. For example, auth_events may include timestamps, session details, and user information, while listen_events track what songs are listened to, including artist and duration, and page_view_events capture web navigation data.

Kafka and Kafka Connect for Real-Time Data Streaming

Kafka plays a crucial role in managing real-time data streams, efficiently handling high throughput and providing fault tolerance.

--

--

Stefentaime

Data engineer sharing insights and best practices on data pipelines, ETL, and data modeling. Connect and learn with me on Medium!