Data Pipelines with OpenFaas on K3s

Published in

Israeli Tech Radar

7 min readApr 18, 2020

Lately, I was confronted with the problem of creating a data lake. This was not such a big task considering you can use any of the big players such as AWS, Google, Azure, and others. Each one of them has some basic setup for you to create a starting model of a data lake with minimal effort. The catch was I was supposed to build a simple working model of a data lake without using proprietary technology.

And the game was on.

Initially, I was inspired by AWS Glue. I understood that the serverless architecture would be a great solution to my problem. But would a serverless architecture be able to handle all of my needs?

The Problem

As part of creating this data lake, I needed to create a pipeline that would receive data, manipulate the data in a number of ways and then write it to some sort of storage. For this purpose I wanted the following capabilities:

Since the input would be a REST interface, I needed support for that.
I wanted the input to be agnostic to the pipeline. True decoupling. This raises a need for some sort of Pub/Sub capability
I wanted the manipulation steps to be agnostic of each other. I did not want to create a single pipeline, rather a couple of small moving particles working together in unison. Again this raises the need for some sort of messaging capability.
I want this whole pipeline to be asynchronous for scalability and high-performance purposes.

The Design

Implementing this using the serverless architecture I would need 4 functions to handle each phase in my pipeline:

Data ingestion
Check Data compliance
Convert Data to different Format
Write Data to Storage

The first function would be a REST interface by which the producer is sending the data into the pipeline. Once received, it would write the data to a topic on the NATS messaging service. Just like a Pub-Sub Model. Each of the other 3 functions listens for events on specific topics on the same messaging service, firing when it is their turn to perform the task in a series of tasks. Thus we have successfully created a data pipeline for persisting data in a data lake architecture type.

As an example, if I am collecting data on click-events the pipeline would look as follows:

Turns out, each function writes to a topic describing the processing detail of the function itself. That sounds right to me.

The Implementation

After some digging and researching, I came across the OpenFaas framework. A mature serverless framework that comes with NATS, an open-source messaging system, incorporated. On top of that, a friend made me aware of this really cool project called K3s, a lightweight Kubernetes distribution built for IoT & Edge computing. Helping me keeping the overview of my K8s, I used k9s, a terminal-based UI to interact with your Kubernetes cluster.

The technological stack was set. I was hyped, hooked and ready to rock!

The default installation suggested by Rancher is using a curl script that installs K3s as a daemon service on the computer. I preferred not to do so, so I installed the K3s binary without the installation script as described in Rancher's well-written documentation on installation options, specifically the following section: “Installing K3s from the Binary”.

After installing the binary, all I had to do was run the following commands:

# sudo k3s server --write-kubeconfig-mode 644 
# export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

Alternatively, I could have used the K3d utility, another really cool tool, which runs a K8s cluster inside a docker container (!!!). To use/run K8s in this manner I would use the k3d binary and run the following command:

# k3d create --publish 31112:31112

More details on how to use these utilities can be found on their respective sites.

Although I used the installation method using the OpenFaas projects yaml files, I do recommend using helm charts. The exact way of installing the OpenFaas helm chart can be found here. In short, the following commands will install OpenFaas on your K8s cluster:

# helm repo add openfaas https://openfaas.github.io/faas-netes/ 
# helm repo update \ 
  && helm upgrade openfaas --install openfaas/openfaas \
  --namespace openfaas \
  --set functionNamespace=openfaas-fn 
# PASSWORD=$(kubectl -n openfaas get secret basic-auth -o jsonpath="{.data.basic-# auth-password}" | base64 --decode) && \
  echo "OpenFaaS admin password: $PASSWORD"

After a successful installation, you can use the password to log into the administration console of OpenFaas: http://localhost:31112

Another utility you will need is the faas-cli, the official CLI for OpenFaas which is used to build and deploy functions. It can be installed quite easily and in order to get the faas-cli working you need to set the OPENFAAS_URL environment variable like so:

export OPENFAAS_URL=http://localhost:31112

Functions

And now the real fun begins. I thought that python was a natural fit for my purposes so I went with that language.

For the data ingestion function we need to get some newer OpenFaas templates:

faas template pull https://github.com/openfaas-incubator/python-flask-template

And now we create the function:

faas-cli new --lang python3-http clickevent-ingest

For the other three functions we use the basic python3 templates, as they do not listen for HTTP events, rather for events on a topic on NATS:

# faas-cli new --lang python3 clickevent-compliance 
# faas-cli new --lang python3 clickevent-conversion
# faas-cli new --lang python3 clickevent-persistence

At this point, you will also have 4 different yaml files in the same folder, each one named after the function created. I suggest you merge those files into 1 yaml file and call it: “clickevent-pipeline.yml”. Don’t worry, you can still manipulate individual functions even though they are in the same yaml file.

NATS configuration

In order to work with NATS the way I intended the following is necessary:

Add a NATS client dependency to the functions
connect the functions to listen to specific topics
Configure NATS to invoke a function for specific topics

NATS client

I used the official NATS Python client, asyncio-nats-client. This means I had to add the following to the requirements.txt file of each function.

asyncio-nats-client==0.10.0

Using this API, I was able to connect and send messages to NATS.

Connect Functions to topics

For those functions which listen on a topic from NATS, I had to add an annotation section in the function description of the yaml file like so:

annotations: 
  topic: <topic-name>

The topic name is different for each one of course.

Configure NATS Connector

This is the tricky part which took me some time to figure out. Although OpenFaas supports different function triggers, the NATS trigger is not a standard one. In order to make this work, you need another deployment to K8s, the nats-connector. And in order to make your own topics work, you first need to download the nats-connector deployment file and make your changes there, specifically in the topics section:

- name: topics 
  value: "clickevent.ingest,clickevent.compliance,clickevent.conversion,"

There are more options to control behavior in this file which I will not get into. Also, it seems this is a deployment I would need to share between different pipelines unless I have such a deployment for each separate pipeline.

Deploy this yaml file to K8s:

# kubectl apply -f connector-dep.yaml

and you are set to go.

Deployment

All you need to do now is write your logic into each function, making sure each function writes to its designated topic on NATS after completion of the task. Once finished you can deploy your function by just running the following command:

# faas-cli up -f ./clickevent-pipeline.yml

Now, in order to invoke your pipeline just send a POST command to your first function, the one listening on a REST trigger and see the magic happen:

# curl --request POST \ 
  --url http://localhost:31112/function/clickevent-ingest \ 
  --header 'content-type: application/json' \ 
  --data '{ "type":"clickevent", "mousebutton": "left", "isnew": true }'

Conclusion

This has been a wonderful journey with a lot of new technologies and seeing all of them coming together is a real joy. Besides that, I think the serverless architecture has proven itself to be a very effective and versatile tool which can be applied in many different ways, this being just one of them. Using Serverless on Kubernetes frees you from any sort of vendor lock-in which I can only recommend.

Not all went as smoothly as described in this blog and I do wish sometimes there was more information on this subject, but here is my help in that regard.

You can find my working demo on Github

I would like to mention some great blogs which got me started:

Enjoy!