Member-only story
Read the full article for free at: https://www.arecadata.com.
High-performance open-source Data Lakehouse at home
Ever wanted to deploy your own Data Lake and on top of it a so-called Lakehouse architecture? The good news is, that now it’s easier than ever with tools like Minio, Trino (with its multitude of connectors), and others. In this article we’ll cover how these components actually fit together to form a “Data Lakehouse” and we’ll deploy an MVP version via Docker on our machine to run some analytical queries.
Code showcased is available here: https://github.com/danthelion/trino-minio-iceberg-example
Data Lake? Lake House? What the hell?
The term “Data Lakehouse” was coined by Databricks and they define it as such:
In short, a Data Lakehouse is an architecture that enables efficient and secure Artificial Intelligence (AI) and Business Intelligence (BI) directly on vast amounts of data stored in Data Lakes.
Basically, if you have a ton of files laying around in an object storage such as s3 and you would like to run complex analytical queries over them, a Lakehouse can help you achieve…