Numaflow: A Tool to Simplify Stream Processing on Kubernetes

David Seapy
4 min readNov 6, 2022

--

In recent years certain open-source tools and languages have emerged as the leaders in various spaces. It is common to find organizations using Kubernetes for container orchestration, Kafka for an event stream store, Prometheus for metrics, and Grafana for monitoring. Similarly Python, Java/Scala, and Go are common in the ML, big data, and platform engineering spaces.

Recently I decided to try out Numaflow to see how well it could serve as a lightweight stream processing tool while integrating with these various open-source technologies. Here’s the description on their Github repository:

Numaflow is a Kubernetes-native tool for running massively parallel stream processing. A Numaflow Pipeline is implemented as a Kubernetes custom resource and consists of one or more source, data processing, and sink vertices.

Numaflow installs in a few minutes and is easier and cheaper to use for simple data processing applications than a full-featured stream processing platforms.

Note: Numaflow is a relatively new project, so information below may become out-of-date as more features are introduced.

Install

The Numaflow quick-start is very straightforward for those familiar with Kubernetes. It uses kubectl to create a few resources in your Kubernetes Cluster:

  • Server (Deployment, Service, & RBAC resources)
  • Controller (Deployment, ConfigMap, Service, & RBAC resources)
  • CRDs (Pipeline, InterStepBufferService, Vertex)

kubectl is then used to create an example “Pipeline” and “InterStepBufferService”, using the CRDs that were just installed. After that you can load up the UI or use kubectl to see it in action.

The UI is currently read-only with permissions managed by Kubernetes RBAC. It provides basic visibility for things like logs, timestamps, resource utilization, and processing rates. For additional metrics, you will want to query the provided Prometheus metrics (discussed in section below).

Eventually you will likely want to use helm or kustomize to modify the install for your needs. For users who prefer to use helm, unfortunately there doesn’t appear to be a helm chart available yet so you will want to create your own. The installation yaml is fairly minimal though, so it’s not too much effort to get it installed using either tool.

Monitoring Metrics & Logs

There are a number of metrics that Numaflow provides, which can be scraped by your Prometheus. For the full list see metrics.md.

Container logs are written to stdout/stderr, which can be picked up by your log collector (i.e. Fluentd, Promtail) and ingested into your log storage (i.e. Elastic, Loki).

Using these metrics and logs, it should be fairly straightforward to create a fairly full-featured Grafana dashboard to monitor the Pipeline and Vertices. So far I’ve only viewed a couple of the provided metrics in Grafana, but as I find time in the next few days I’d like to explore monitoring in more depth, both from what Numaflow provides out-of-the-box and also seeing if I can expose my own metrics from my user-defined functions and sinks.

Event Storage

When working with streaming data it’s common to consume and produce data that resides on Kafka topics. Fortunately it’s very easy to configure Numaflow sources and sinks to point to a Kafka cluster.

For storing messages between Numaflow vertices an InterStepBufferService is used. Currently it can be configured to use one of the following for storage:

  • JetStream (internal)
  • Redis (internal or external)

Both JetStream and Redis can be managed by Numaflow using StatefulSets and configured in the InterStepBufferService custom resource. Redis can additionally point to an external Redis cluster that is managed outside of Numaflow.

Redis is widely adopted, but introduces risk of data loss during failover. Jetstream can guarantee no data loss, but is relatively new and not as widely adopted. It would be nice to be able to configure an InterStepBufferService to be used with an external Kafka, similar to how Sources and Sinks work with Kafka. Maybe that capability will be added in the future.

Language Support

Currently there are 3 available SDKs that can be used to do custom processing for internal vertices or sink vertices. There isn’t yet a way to provide code for source vertices (https://github.com/numaproj/numaflow/issues/19).

If developing using one of these SDKs, the amount of code you have to write is extremely minimal. Getting something written and deployed is made easy with each SDK providing examples that you can modify for your use-case.

Even if the language you are using doesn’t have an SDK available, it’s not too much trouble to get things working as you just need to build a container running a GRPC server on a Unix domain socket. The Numaflow-go git repository contains .proto files you can use to generate code to satisfy the API.

In the Pipeline resource you are able to configure the Pods that run your container. Most of the Pod spec is available for you to customize as well as labels and annotations . This includes things like resource requests/limits, args, env, volumes, init-containers, etc.

Conclusion

What I like most about Numaflow is it keeps things simple and integrates well with widely-adopted open-source software. I can keep using my existing tools for metrics, logging, configuration, scaling, and GitOps delivery. The choice of programming language is no longer constrained by how well it’s supported in the stream processing platform.

Numaflow seems good for teams:

  • Already using kubernetes
  • Using programming languages that are not supported on other stream processing platforms
  • Looking to minimize cost, complexity, and learning-curve
  • Needing basic stream processing capabilities
  • Not needing ordered messages

Numaflow may not be best for teams:

  • Not familiar with kubernetes
  • Only needing to develop applications that run on the JVM
  • Needing a more established, full-featured stream processing platform

--

--