Monitoring Stream Processing in a Kubernetes-Native Environment

6 min readNov 14, 2022

When managing distributed applications that process streams of data, observability is important for ensuring overall system availability and performance. This is especially true as the topology becomes more complex. It’s useful to be able to quickly identify where any problems are at a high-level and then drill down sufficiently to identify the root cause.

A week ago I shared some of my initial impressions using Numaflow as a stream processing tool and mentioned I was hoping to dive a bit deeper into what monitoring for it might look like. This past week I had some time to do just that. When considering which monitoring tools to use, I was looking for ones that were:

open-source
simple (both to maintain and use)
low-cost
language-agnostic
kubernetes-native

The sections below summarize which software I’ve adopted, and how each is being used to gain visibility into the stream processing applications.

Note: Most of the monitoring setup does not depend on Numaflow specifically, but rather applies to Kubernetes in general. If not currently using Numaflow, ensuring useful pod labels and exposing Prometheus metrics in your code can go a long way. That said, check out Numaflow because it makes the setup much simpler.

Software Choices

For visualizing metrics and logs I chose to use Grafana. It provides a wide variety of built-in panel types and data sources, while also allowing you to build your own. While Numaflow has a dedicated UI, it can be beneficial to have metrics in Grafana when wanting to:

Provide visibility that extends beyond the scope of Numaflow. For instance, a team running a UDF that performs inference might benefit from viewing the processing time and resource usage alongside the model performance.
Standardize how metrics and alerts are visualized for streaming and non-streaming applications.
Query historical metrics or logs.

For metrics and alerts I installed the kube-prometheus-stack helm chart, which includes the Prometheus Operator. With the operator, a single ServiceMonitor resource is all that is needed to get all the Numaflow pipeline metrics ingested into Prometheus. Additionally you can provide PrometheusRule resources to configure alerting.

For a log aggregation I used Loki. It can use AWS S3 for storage, making it significantly cheaper and easier to manage than alternative solutions. It has a several deployment modes, allowing me to start simple and change deployment mode in the future if necessary.

Grafana Data Sources

Both the prometheus data source and loki data source are included by default in Grafana. The prometheus data source is what I used for most of the Grafana panels.

Grafana includes a node graph panel (beta), but it’s limited in the data sources it currently supports. As a fun side project I wrote a PoC data source plugin that supports viewing a Numaflow Pipeline as a Node Graph. While table panels are supported by the prometheus data source, I used the plugin to populate some of tables as well.

Grafana Dashboards

Using the data sources discussed previously, I created two Grafana dashboards for each Numaflow CRD to provide both high-level and detailed views.

A very useful feature of Grafana is chained variables, which allows for drilling down from namespaces to pipelines to vertices by restricting a variable’s options based on the current value of other variables. Additionally, variables may be configured to allow users to select more than one value. Using multiple values for chained variables is a quick and easy way compare the performance of multiple versions of a vertex or an entire pipeline.

Notes for performance comparisons:

Configure vertex pods appropriately to ensure Kubernetes scheduling decisions don’t introduce unexpected performance differences.
For comparing pipelines reading from a single Kafka topic, ensure their source vertices are configured with different consumer groups to ensure both pipelines process all the same messages.

Grafana’s table panel has proven helpful in viewing the current state of things in Kubernetes. It allows for filtering and sorting columns, as well as setting cell color and links based on values in the cell. This makes it quick to identify where a problem currently is, and drill down using the links to other more specialized dashboards or to the Numaflow UI.

The node graph panel has been useful for visualizing the relationships between the vertices, while also providing some basic vertex and edge metrics. Getting it working with Numaflow as a side-project has been a lot of fun. That said, the node graph panel is still in beta and isn’t supported by any relevant built-in data sources. Because of this, for most teams I would instead recommend linking to the pipeline visualization provided in the Numaflow UI.

While I used the table and node graph panels to provide an overview of the current state of my applications, the Prometheus and Loki panels have been great for querying historical metrics and logs. Having installed kube-prometheus, I was able to view vertex pod metrics in the same dashboard as all the Numaflow-provided metrics.

Both the Prometheus and Loki data sources come with a builder that is useful for constructing queries for those not yet familiar with their PromQL or LogQL query languages.

Be sure any kubernetes pod labels you want to query against are available as labels in Prometheus and Loki. Here’s an example overrides for doing that if installing the loki-stack helm chart:

# ref: https://github.com/grafana/loki/issues/3519
# consider restricting pod labels passed through to Loki if high label cardinality is a concern
promtail:
  config:
    snippets:
      extraRelabelConfigs:
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)

Conclusion

Combining tools like Numaflow, Grafana, Prometheus and Loki can provide an excellent monitoring solution for streaming applications on Kubernetes. What I’m most excited about is being able to use the same tools and techniques to monitor stream processing as any other application.

In the coming weeks I’d like to explore:

Visualizing custom Prometheus metrics for udf and udsink containers
Numaflow’s recent and upcoming features, specifically how they can be leveraged for various use-cases
Autoscaling and resource optimization
Visualizing cost metrics for pipelines and vertices

Monitoring Stream Processing in a Kubernetes-Native Environment

Software Choices

Grafana Data Sources

Grafana Dashboards

Conclusion

Written by David Seapy