Data Lakehouse with Apache Iceberg in Kubernetes (Minikube) — Ideation

3 min readJun 15, 2024

Intro

One morning, I checked my email and got invoice from one of the cloud provider that I need to pay, not much but enough to ruin my day. The silly reason is I forgot to turn off the engine that was running in the cloud doing nothing. :-)

So, long story short, that personal project was about building my own data lakehouse for dashboarding, ML, or AI use cases. Since I have to work 9–5, building this personal project only happened during the night (if I am in the mood as well). So, it will be a husle turning off or on again the machine, etc.

After that unfortunate event,

I got the idea of why not building data lakehouse locally and later I will deploy into the cloud.

Blueprint

Feeling inspired, I laid out my plan but don’t worry I leave you the burden of the process and skip onto this diagram as the outcome.

The architecture is pretty simple,

Storage Layer to store your data and catalog it with certain format and standard (iceberg).
Compute Layer to do computation on top of your data in the storage layer.
Client Layer to interact with compute layer and send instructions.
Admin Layer to deploy the infrastructure and do admin stuff.

Installation

As you can see, There are couple of components to stitch, I won’t go into more detail in this post but I want to explain about the pre-requisites and my environment. I use Mac M1 pro with Sonoma OS with additional software as below.

Kubectl = Kubernetes CLI
Docker Desktop = to be the driver of minikube
Minikube = Kubernetes sandbox in local laptop
VSCode = IDE (you can choose the other as well)

!! warning: please be aware that minikube will create a VM on docker which will be used to install the kubernetes. It means more complicated and not straight-forward as direct installation.

After you install all of the software above, If you can see the version as below, it means the minikube is ready to be used.

minikube version

minikube version: v1.33.1
commit: 248d1ec5b3f9be5569977749a725f47b018078ff

Create Kubernetes Cluster

We are going to use only one kubernetes cluster and use it for multi micro-service as mentioned in the diagram above. To create a cluster, simply type in command below to create cluster called “datasaku-cluster”

minikube start -p datasaku-cluster 
  --disk-size 20000mb \
  --driver docker \
  --memory=max \
  --cpus=max

You can also follow this link https://minikube.sigs.k8s.io/docs/handbook/kubectl/ to link minikube and kubectl

You can now go to the VM in docker by ssh into it.

minikube ssh -p datasaku-cluster

Also, let’s install the add-ons as well.

minikube addons -p datasaku-cluster enable ingress 
minikube addons -p datasaku-cluster enable ingress-dns
minikube addons -p datasaku-cluster enable storage-provisioner

Since I am using macOS, we would need to do tunneling into Minikube. You might want to keep the window open for the tunnel.

minikube tunnel -p datasaku-cluster

Now, the the kubernetes cluster is ready to deploy the micro-service. In the following post, I will explain separately one by one:

Postgresql-ha in Kubernetes (Minikube)

Postgres HA in Kubernetes (Minikube)

Deploy High Availability PSQL in Kubernetes Minikube

medium.com

Minio in Kubernetes (Minikube)

Minio in Kubernetes (Minikube)

Intro

medium.com

Nessie in Kubernetes (Minikube)

Nessie in Kubernetes (Minikube)

Read along to know how you can deploy nessie in Kubernetes with minikube

medium.com

Spark in Kubernetes (Minikube)

Apache Iceberg with Spark in Kubernetes (Minikube)

Deploy spark with iceberg in minikube

medium.com

Jupyterhub in Kubernetes (Minikube)
Python in Kubernetes (Minikube)

Conclusion

Deploying data lakehouse in kubernetes has benefit such as better provisioning, easier to scale up, etc. However, the complexity sometimes annoying. In the end, whatever path you choose, building data lakehouse with iceberg is a really fun experiment.

Github: https://github.com/kurangdoa/lakehouse_iceberg/tree/main

Building a Data Lakehouse using Apache Iceberg and MinIO

Introduction In a previous post, I provided an introduction to Apache Iceberg and showed how it uses MinIO for storage…

blog.min.io

Data Lakehouse with Apache Iceberg in Kubernetes (Minikube) — Ideation

Intro

Blueprint

Installation

Create Kubernetes Cluster

Postgres HA in Kubernetes (Minikube)

Deploy High Availability PSQL in Kubernetes Minikube

Minio in Kubernetes (Minikube)

Intro

Nessie in Kubernetes (Minikube)

Read along to know how you can deploy nessie in Kubernetes with minikube

Apache Iceberg with Spark in Kubernetes (Minikube)

Deploy spark with iceberg in minikube

Conclusion

Building a Data Lakehouse using Apache Iceberg and MinIO

Introduction In a previous post, I provided an introduction to Apache Iceberg and showed how it uses MinIO for storage…

Written by kurangdoa