BigData and Kubernetes: love at first sight

A couple of years ago, I fell in love with Kubernetes, I started to play with, and I began to introduce it into the software architecture I designed and implemented.

I’ve never regretted that choice, the abstraction layer provided by k8s and its flexibility helped me a lot in putting together sophisticated platforms. I don’t want to discuss the good and the bad of k8s, and I think that most of the people believe that k8s event with its limitation has been a significant advancement in infrastructure management.

What I started to investigate in the last year is how k8s could be used for deploying complex distributed platforms like the ones popular in the BigData world. In the last ten years, I’ve been mainly focused on architecting BigData platforms, including the installation of Hadoop on big clusters, so I know the burden in installing Hadoop on tens of machines, with Kerberos, all the SSL/TLS stuff.

So, I thought that k8s could be the ideal system for mitigating all that complexity. I didn’t discover anything, and there is a flourishing of “kerberization” for all the most popular BigData platforms like Cassandra, Scylla, Kudu, Presto, etc. Moreover, a BigData giant like Cloudera embraced this approach with its new product offers.

Nevertheless, since I’m still a hands-on architect and I love to make my hands dirty, I decided to put together a full BigData stack completely deployable on k8s. My experiment is nothing more than a toy, but perhaps it could be used for quickly providing a BigData setup for a small group of persons where security is not the primary concern.

My stack contains:

  1. Openebs an extremely simple to use storage layer for k8s
  2. Minio a great S3 clone
  3. Dremio a powerful parallel SQL engine fully integrated with S3/Minio
  4. Kafka one of the best parallel message brokers around
  5. NiFi simple and robust data pipeline automation tool
  6. Openwhisk a multilanguage serverless platform
  7. Jupyterhub multi-user server for Jupyter notebooks

I played a lot with this stack running on my tiny NUC 4-node cluster at home, it really works well and I’m using it with Spark running on k8s as well, lot of experiment with NiFi and Kafka, Spark structured streaming, etc. I found it very nice as a small BigData lab where I could try what I have in mind.

The deployment is very simple. I’m using bash scripts and the popular helm k8s packaging with some level of parametrization.

You can find the entire project here on Github, and let me try to explain a bit better how it works.

After cloning the project, you can find a config file where you could customize a couple of things depending on your environment. In the config file, I put a description, I hope it’s enough descriptive.

This config file is used by a simple templating mechanism that generates values yaml file in the directory helm/values. In that directory, you can find one values-<platform>.yaml.template, one for each target platform.

If you look at any of those templates you could find that the variables needed to be substituted by the values in the config file are surrounded by {{ }}.

Once modified the config file, just running ./ should put everything together.