Stackable Demo: nifi-kafka-druid-earthquake. Data Engineering on K8s

Alex McLintock
5 min readDec 6, 2023

--

This was my third Stackable Demo immediately after the Airflow one.

https://docs.stackable.tech/home/stable/demos/nifi-kafka-druid-earthquake-data

It is less of an instruction manual, and more of an engineer’s log. “I got it working and here is how I did it”

Prerequisites

stackablectl, linux, and a set up k8s cluster. See my previous articles for more info.

nifi-kafka-druid-earthquake-data

#!/bin/sh -x 
stackablectl demo install nifi-kafka-druid-earthquake-data

This took a little longer to get working. Several things did not come up with external endpoints straight away and reported problems. These seemed to be time-out problems or race conditions. For example, one run said it was unable to download the data, but I was able to test the data was there using a plain old web browser. I think I am having problems simply because I am using a not terribly fast home file server, and an average speed ADSL internet connection.

If we assume that everything that stackable thinks is installed correctly is installed correctly then we can just retry to install the failed components by rerunning the above command. In particular I had problems installing druid which took several attempts to get working. Sometimes I would delete components through the K8s dashboard to help make sure that the install script would reinstall them.

Unfortunately, I still had problems with superset. As the visualisation tool in these demos I assumed that no other software depended on it. I decided to go through and delete all stackable content relating to superset and do a reinstall.

I figured out these commands by looking at the k8s dashboard

#!/bin/sh -x
helm uninstall postgresql-superset
kubectl delete -n stackable-operators deployment superset-operator-deployment
kubectl delete -n default statefulset superset-node-default
kubectl delete -n stackable-operators configmap superset-operator-configmap
kubectl delete clusterrolebinding superset-operator-clusterrolebinding
kubectl delete clusterrole superset-clusterrole
kubectl delete clusterrole superset-operator-clusterrole

And then I reinstalled the whole demo with the install command, but the stackablectl command is smart enough to only install the components which are missing.

I have to say that it failed to show anything for a while but when I came back to it the next day it had run all its initial data jobs. This is not great, but would have been easier if I were more familiar with the demo.

The first thing I checked was Apache Nifi — this takes a CSV file of events (ie earthquakes) and feeds them as individual events to Kafka.

Remember the command to see the external endpoints for your installation

stackablectl stacklet list

Note that this contains the airflow demo as well which is just because I hadn’t deleted it before trying this one. There is no external endpoint for Zookeeper because it is only used internally, I think.

Anyway — that tells me the endpoint for NiFi

Apache NiFi

NiFi essentially gets data from one place and puts it in another.

Apache NiFi

One NiFi process for fetching the data via HTTP, and one for publishing the records to Kafka.

From the image you might get a little confused — Zero bytes read in, Zero bytes written out? The problem is I looked at this some time after it executed, more than five minutes. It had already done its work. Note the big spike of 500Mb it processed at the beginning of the run.

Apache Druid

Apache Druid is essentially a time series database suitable for storing and processing events where every item has a time. I am not sure what you can do with the web interface and so I don’t know whether this is correct, but you can see it is up and running.

Apache Druid web interface

“Error not found” is a bit worrying, but I don’t yet know if that is correct for this demo. Anyway — let’s move on and have a look at the S3 storage and then the Data Visualisation tool

MinIO

I had a look at the MinIO web interface. It seems like there is one S3 bucket created, but nothing in it. This is slightly suspicious. However since this is a streaming demo I guess it may have been included, but not actually used in this demo. Or maybe I accidentally left it around from a previous demo.

My MinIO install shows there is one S3 bucket “demo” with nothing in it.

Apache Superset

Apache Superset is a dashboard and charting tool. It has progressed quite a lot in recent years and looks quite impressive. Yes, there are commercial tools like Tableau and PowerBI, but I still think this is worth investigating.

Here is the initial screen you get when looking at the Superset web interface for this demo

So we have one dashboard which consists of three charts…

Selecting one the charts gives you the details of how it is calculated

A sample chart in the Apache Superset application

And here is that same chart in place on a dashboard

Earthquakes dashboard

Conclusion

All in all, not bad for one command.

Credits:

Image of earthquake map taken from Stackable ( https://docs.stackable.tech/home/stable/demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data ) Thanks.

Words and mistakes from Alex McLintock https://bit.ly/m/alexmclintock

--

--

Alex McLintock

Big Data Enthusiast, Analytics/DS/ML Platform Consultancy in London