Data Lakehouse with Apache Iceberg in Kubernetes (Minikube) — Ideation

kurangdoa
3 min readJun 15, 2024

--

Intro

One morning, I checked my email and got invoice from one of the cloud provider that I need to pay, not much but enough to ruin my day. The silly reason is I forgot to turn off the engine that was running in the cloud doing nothing. :-)

Photo by Jp Valery on Unsplash

So, long story short, that personal project was about building my own data lakehouse for dashboarding, ML, or AI use cases. Since I have to work 9–5, building this personal project only happened during the night (if I am in the mood as well). So, it will be a husle turning off or on again the machine, etc.

After that unfortunate event,

I got the idea of why not building data lakehouse locally and later I will deploy into the cloud.

Blueprint

Feeling inspired, I laid out my plan but don’t worry I leave you the burden of the process and skip onto this diagram as the outcome.

The architecture is pretty simple,

Storage Layer to store your data and catalog it with certain format and standard (iceberg).

Compute Layer to do computation on top of your data in the storage layer.

Client Layer to interact with compute layer and send instructions.

Admin Layer to deploy the infrastructure and do admin stuff.

Installation

As you can see, There are couple of components to stitch, I won’t go into more detail in this post but I want to explain about the pre-requisites and my environment. I use Mac M1 pro with Sonoma OS with additional software as below.

!! warning: please be aware that minikube will create a VM on docker which will be used to install the kubernetes. It means more complicated and not straight-forward as direct installation.

After you install all of the software above, If you can see the version as below, it means the minikube is ready to be used.

minikube version
minikube version: v1.33.1
commit: 248d1ec5b3f9be5569977749a725f47b018078ff

Create Kubernetes Cluster

We are going to use only one kubernetes cluster and use it for multi micro-service as mentioned in the diagram above. To create a cluster, simply type in command below to create cluster called “datasaku-cluster”

minikube start -p datasaku-cluster 
--disk-size 20000mb \
--driver docker \
--memory=max \
--cpus=max

You can also follow this link https://minikube.sigs.k8s.io/docs/handbook/kubectl/ to link minikube and kubectl

You can now go to the VM in docker by ssh into it.

minikube ssh -p datasaku-cluster

Also, let’s install the add-ons as well.

minikube addons -p datasaku-cluster enable ingress 
minikube addons -p datasaku-cluster enable ingress-dns
minikube addons -p datasaku-cluster enable storage-provisioner

Since I am using macOS, we would need to do tunneling into Minikube. You might want to keep the window open for the tunnel.

minikube tunnel -p datasaku-cluster

Now, the the kubernetes cluster is ready to deploy the micro-service. In the following post, I will explain separately one by one:

  • Postgresql-ha in Kubernetes (Minikube)
  • Minio in Kubernetes (Minikube)
  • Nessie in Kubernetes (Minikube)
  • Spark in Kubernetes (Minikube)
  • Jupyterhub in Kubernetes (Minikube)
  • Python in Kubernetes (Minikube)

Conclusion

Deploying data lakehouse in kubernetes has benefit such as better provisioning, easier to scale up, etc. However, the complexity sometimes annoying. In the end, whatever path you choose, building data lakehouse with iceberg is a really fun experiment.

Github: https://github.com/kurangdoa/lakehouse_iceberg/tree/main

--

--