Machine Learning on Streaming Data run on the Edge using K3s, Seldon Core, Kafka and FluxCD

Published in

DataReply

15 min readOct 21, 2021

An architecture overview and documentation of our demo that processes object detection on an edge-based K3s cluster. The stream of recognized objects is replicated into the cloud where more sophisticated analytics on streaming data can be carried out and the data integrated into longer-term storage for use in model retraining

Our Raspberry Pi based Kubernetes (K3s) cluster attached to 4G router

We at Data Reply care about stream processing and the scalable operation of machine learning models. Moreover, we have observed an increased demand for hybrid cloud-edge solutions recently. Customers are seeking a unified platform to operate their business-critical applications and processes across network boundaries while exploiting the benefits of either world (e.g. proximity and independence vs. endless scalability, elasticity, and the presence of managed services). We have therefore undertaken this project to showcase how both ML and streaming can be combined to build a system that is able to both:

Recognize objects in a stream of images with low latency on edge
Carry out more complex and near-real-time analytics in the cloud

Machine Learning (ML) use-cases are especially well suited for a hybrid cloud-edge architecture where certain prediction outcomes will need to trigger an immediate system reaction. However, the sheer volume of data that is generated by deploying a multitude of such edge devices can be too many to handle for classical on-site infrastructure. We, therefore, advocate for postprocessing, long-term storage and analytics to be done in the cloud.

The development and operation of the underlying production Machine Learning applications are considered an emerging hot topic. It is accomplished via the use of MLOps practices and technologies. These enable the development of applications from PoC to production and offer lifecycle best practices for their iterative improvement. This allows the generation of consistent, long-term business value from ML products. In particular, it ameliorates the following ML project pain points:

Experiments tracking and model versioning
Monitoring model performance and evolution
Scalable model deployment
Automated re-training
Automated testing and release processes

We consider the application of the MLOps principles a major success factor in the implementation of large-scale ML projects and we therefore also apply these fundamentals in the current demo.

Building this demo project sharpened our expertise in relevant technologies and platforms and therefore helped us to develop and refine a solution offering in this particular area. In the following sections, we aim to highlight the approach we took in building such a hybrid solution as well as some of the technical challenges we faced along the way.

Edge Architecture

Our edge cluster is hosted on three Raspberry Pi 4 nodes hosting a K3s installation, a lightweight and ARM-compatible version of Kubernetes. This cluster provides more compute capacity than needed by the use-case but this level of redundancy is intentional. It ensures that the edge location would be able to sustain the loss of up to two of these nodes.

We decided to run most software components needed for our use-case inside Kubernetes (k3s). We believe that this approach brings a lot of benefits for operating software fleets run on edge devices.

One core benefit is that Kubernetes can reduce many of the complexities involved in operating distributed systems such as balancing the deployment and distribution of workloads, providing redundancy against failures, identifying outages and shifting affected workloads automatically.

Moreover, relying on Kubernetes allows us to reuse bundled software artifacts published as Helm charts that can be adjusted through configuration values.

Architecture of the Edge-Cluster and installed software components

The cluster is connected to the Internet using a 4G router to simulate typical edge conditions (low bandwidth, varying latency, and temporary connection outages). As mobile network operators usually employ some means of NAT, managing public services (e.g. VPN server or webserver) does not work directly. Instead, we use Amazon SSM Agent to create client-initiated tunnels from the Raspberry nodes to AWS. These tunnels automatically recover in case of node failures or connection losses and can be used for remote SSH access and port forwarding. In fact, we use Amazon SSM to bootstrap the required client configuration on our laptops required to interact with the Kubernetes API run on the cluster.

Moreover, the data flow on the edge can be summarized as follows:

(1,2) A simple go app has been developed that reads the video stream created by the webcam, samples the stream, and writes the image representation into a Kafka Topic.

(3) A Kafka Streams app we call the Kafka Seldon Bridge reads this topic, generates a Seldon request using gRPC, and forwards that to the Seldon deployment to obtain the list of recognized objects. Be aware that Seldon does support Kafka as model source natively. This is achieved by declaring a topic as input to the Seldon deployment so that each record will be fed into the model automatically and the output store in a target topic. However, we required additional means of pre- and post-processing plus state management and therefore opted to employ a custom implementation using the Kafka Streams framework.

(4) The prediction response (which objects have been recognized at which confidence) is written back to a Kafka Topic and could serve local applications deployed at the edge location. Potential use cases include: triggering alarms upon intruder detection, adjusting ventilation based on the number of people in a room, or interrupting the assembly line upon visual detection of manufacturing issues.

(5) The topic of recognized items alongside parts of the captured input data is replicated to the cloud using the MirrorMaker 2. This setup does not require a constant internet connection but buffers data locally on the edge brokers until a given threshold (topic retention) is reached. The payload of the recognized items is Avro encoded and significantly less storage intensive than the original data. Another technique is being used to further reduce storage and network demands — this is discussed in a later section. Both measures allow deployment of this setup also in regions where internet access is rather unstable and provides lower average bandwidth.

Cloud Architecture

The system operated in our AWS VPC relies on the same configuration repository to manage desired software and configuration artifacts through FluxCD. The difference in configuration compared to the edge cluster is expressed using kustomize artifacts. This approach is elaborated in a later section.

In order to help ensure environment alignment, we decided to use ARM-based EC2 instances. That allowed us to infer Raspberry compatibility before our edge had actually been set up. Some of the involved software artifacts were not provided as ARM-based builds or otherwise came without official support. The cloud-based cluster helped us test if (custom) builds were functional early on in the project phase.

The cluster can be accessed from our local machine via SSM and has been set up using terraform, mostly relying on sagittaros’ terraform module.

The system runs in a private VPC and exposes only the Kafka service publicly through a load balancer. That is achieved without requiring a specific infrastructure code or any manual configuration. Instead, we leverage Kubernetes’ AWS infrastructure provisioning capabilities. That is, Kubernetes recognizes the intent to allow ingress traffic proxied through a load balancer into the cluster.

The Strimzi Kafka configuration will yield the creation of a Kubernetes Service/Loadbalancer object, which in turn will create the AWS infrastructure needed to operate the ingress of Kafka traffic towards the brokers.

Traffic sent to the public endpoint is encrypted using TLS and client applications are required to authenticate using SASL/SCRAM. The usage in production, however, would require additional security measures such as the implementation of a site-to-site VPN tunnel from edge to cloud. As well as providing enhanced security, the presence of a tunnel introduces advanced network capabilities such as bidirectional service availability (which would allow a metric database in the cloud to fetch metrics from the edge clusters).

Technology Stack

Kafka sits at the core of our platform and serves as the main storage & communication layer. Its primary purposes within our project are the following:

to act as an input buffer to our model deployment,
to act as an asynchronous replication buffer to the cloud (using MirrorMaker),
to provide lifecycle management of model data (retention-based clean-up)
to act as a framework for (stateful) stream processing.

Both Kafka, as well as the MirrorMaker 2 (replication to the cloud) installation, are managed through Strimzi, a Kubernetes operator that aligns the desired state manifested in configuration with the one observed. We decided to adopt this approach since It significantly lowers the overhead compared to managing the Kafka installation ourselves.

Note that the resource overhead of running Kafka will be significantly reduced in the future since Zookeeper will be phased out by the next major release. This will make Kafka more lightweight and from our perspective even more viable to be run on edge devices.

Similarly, Seldon Core follows the same pattern to deploy, operate and monitor our packaged machine learning model. It wraps our model, exposing a microservice interface that is invoked through the dependent applications via gRPC or REST.

The actual machine learning model we have deployed is YOLOv3, which consists of 2-D and 1-D Convolutional Neural Network (CNN) layers stacked together. The choice of an appropriate object detection model was crucial to the demo, as the deployed model must have a:

high accuracy to identify objects correctly in each frame
a high frames per second (fps) processing speed for low latency.

When compared to other models such as RetinaNet, SSD & YOLOv2, YOLOv3 incorporates the following architectural improvements:

DarkNet-53 feature extractor: This contains a mixture of Convolutional Neural network and residual neural network (ResNet) layers. The combination of these layer types allows for the construction of deeper neural networks. This helps in the isolation of the different objects in an image.
Feature detector: This is a multi-scale detection architecture, which instead of just using single network of CNN layers, utilizes multiple networks and aggregates their outputs to determine the final detection results. This ensures that even the small objects in the frame are detected which was a major issue when using YOLOv2.

Additionally, the fps processing for YOLOv3 showed a significantly faster response. YOLOv3 therefore offers a balanced trade-off between detection accuracy and frame processing time.

Scene captured in our office, ingested into Kafka and the YOLOv3 model having been applied

From Stateless to Stateful Stream Processing

Stream processing tends to be easier when it is stateless. That means, data is processed on a per-record-basis, and no knowledge of previously seen records is required to carry out the current computation. It becomes more complicated when any historical state is involved. This holds in terms of use case logic but also for the involved frameworks and systems that need to be able to handle more complex recovery scenarios and are now concerned with the temporal sequence of data records (e.g. counting the number of recognized bread rolls in the last 30 minutes).

We started this demo following the stateless approach such that for each sampled image, we would predict the recognized objects and send the respective result back into the output topic. However, this leads to many redundant model executions and increased i/o cycles (network and de-/serialization), especially if a static scene is being observed.

We have addressed this issue by sending two images to the Seldon deployment, one being the currently recorded one (i2) and the second one being the last accepted one (i1). In case i2 is considered significantly different from i1, it will be accepted, object recognition applied, and the recognized items returned alongside the status code 200.

The case where both images are recognized as being very similar will be acknowledged with status code 204. The bridge won’t update the internal state store and will not emit any output record.

Kafka Seldon Bridge sending most recent and latest accepted image to Seldon deployment

Not only does this reduce computational overhead, but also network load towards the cloud as records should only be sent in case of relevant changes in the image. Moreover, the notion of the output records changes and becomes somewhat more interpretable. From recognized objects on a frame basis, we move towards the notion of changes of recognized objects.

Image Similarity Measure and Seldon Transformers

The key success factor of this approach is, however, that the similarity measure can be computed sufficiently quickly. In fact, it must be considerably faster than the object recognition, especially if the analyzed frames tend to change frequently. More specifically, if the object recognition takes 100ms and the similarity calculation 20ms, a success rate (two images recognized as similar) of at least 20% is required for the routine to pay off.

We used the peak signal-to-noise ratio (PSNR) as a similarity measure and varied the relevant threshold with the scenery being recorded. The image similarity function plus acceptance logic described above was implemented inside the model class deployed with Seldon. In order to make the combined MLdeployment de-coupled and re-usable, each component of the deployment is running in a separate container. Measures have also been taken to make the whole deployment independent of the frame size that is being received. This is done using Seldon Transformers.

The whole deployment needs no pre-processing to be done prior to data ingestion. The frames, irrespective of their sizes, are received by the input transformer. This transforms these frames into the size that is expected by the model. Inside the model module the similarity function computes whether the frames are dissimilar enough to call the model and then object detection is performed.

Seldon Model architecture with change detector logic depicted

However, we found that the execution time of the YOLOv3 model was already so fast that applying this optimization technique yielded significant benefits only in situations where few changes were observable in the image stream.

Moreover, having one or many transformers in the execution path will lead to additional cycles of de-/serialization. While transformers are certainly the more elegant and flexible way (you can chain your model and append pre-and post-processing procedures) of constructing your deployment, we feel they should be used carefully in latency-critical applications. Otherwise, you might end up in a situation where processing time is being dominated by the beforementioned cycles of de-/serialization.

Edge-based Aggregation and Analytics

The bridge or additional frameworks to process Kafka topics (we used KSQL in our demo) can be used to further modify, downsample, or otherwise change the contents of the output topic before replicating it to the cloud.

This is useful if the throughput needs to be reduced even more in case of very poor internet connectivity. In addition to that, we could place a lightweight edge-local analytics platform that does not require any internet connectivity and is able to visualize statistics on the streaming data as it is being generated, without requiring massive compute resources, storage, or internet capabilities.

Multi-Cluster Configuration Management and Continuous Delivery

We manage the configuration of software artifacts inside a GitHub repository. Most of that is expressed in terms of Helm chart references and configuration of those aligned to our use case.

We decided to use the pull-based continuous delivery framework Flux alongside its Kustomize controller plugin. The pull-based approach does not require any connectivity between a central location towards the edge locations. Instead, controller deployments, running in Kubernetes on the edge cluster, pull changes from the repository as they occur.

Moreover, the framework allows for DRY (Don’t Repeat Yourself) configuration management since releases and artifacts to be installed can be listed across environments. The configuration specific to a given environment is then modified using Kustomize resources. We use this technique to install the Kafka cluster with identical configurations across our environments (edge, cloud with ARM processors, and cloud with Intel processors).
While most of the configuration is the same in both environments, we are required to express certain environment-specific configurations. Examples include:

overriding the Docker images used to deploy the Kafka cluster on the ARM platform
disabling the public Kafka ingress listener on the edge cluster
reducing memory limits on the edge-based Kaka cluster

The directory apps lists all applications to install on an environment basis alongside their configuration, while base refers to regardless of the environment.

├── apps
│   ├── base (configuration across clusters)
│   │   ├── seldon-core
│   │   │   ├── ...
│   │   ├── seldon-models
│   │   │   ├── kustomization.yaml
│   │   │   └── yolo-v3.yaml
│   │   ├── strimzi-operator
│   │   │   ├── kustomization.yaml
│   │   │   ├── namespace.yaml
│   │   │   └── release.yaml
│   │   └── strimzi-resources
│   │       ├── cluster.yaml
│   │       ├── kafka-to-seldon-bridge.yaml
│   │       ├── kustomization.yaml
│   │       ├── topic.yaml
│   │       └── user.yaml
│   ├── dev (configuration for intel environments)
│   │   └── kustomization.yaml
│   └── dev-arm (configuration for edge and arm environments)
│       ├── 2frame-model-patch.yaml
│       ├── kustomization.yaml
│       ├── remove-listener.yaml
│       ├── replicator.yaml
│       ├── seldon-patch.yaml
│       ├── strimzi-kafka-patch.yaml
│       └── strimzi-patch.yaml
├── clusters

Looking into apps/base/strimzi-operator/release.yaml we find the below-mentioned file which instructs Flux to install the Strimzi operator in all environments.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: strimzi-kafka-operator
spec:
  releaseName: strimzi-kafka-operator
  chart:
    spec:
      chart: strimzi-kafka-operator
      sourceRef:
        kind: HelmRepository
        name: strimzi
        namespace: flux-system
      version: "0.24.0"
  values:
    createGlobalResources: true

However, we needed to build custom images for the ARM-based clusters of Strimzi and do so by overriding the relevant Helm chart values inside apps/dev-arm/seldon-patch.yaml.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: strimzi-kafka-operator
  namespace: strimzi
spec:
  values:
    imageRegistryOverride: ghcr.io/twiechert
    imageRepositoryOverride: strimzi-kafka-operator
    imageTagOverride: arm-64

The approach we have implemented is heavily based on what flux is providing as a reference implementation.

Cloud Replication and Real-Time Analytics

The Kafka Connect based mirroring and replication solution MirrorMaker 2 is used to replicate topics from the edge location to the Kafka cluster run in the cloud. Without any further tweaking of producer or Kafka Connect settings, we could observe an average end-to-end latency of around 800ms. That means, on average ~800ms pass between the point in time the image is captured and the time the recognized objects are available for further analysis in the cloud.

The presented use case is simple as there is only one edge location. However, we envision this setup as being deployed in an n:1 fashion, where there are many edge sites replicating their relevant data into a central Kafka cluster run in the cloud. Utilizing a real-time OLAP system, we can visualize statistics and provide dashboards in real-time. Ad-hoc queries can be posed across sites to provide an indication of site performance, or as a means of anomaly detection relevant to production monitoring.

Ad-hoc Real-Time, and Hybrid Queries

In our demo we have used Apache Druid and the Imply Pivot products to build a real-time dashboard presenting the recognized items at our edge location. This runs in the cloud and reads the replicated data stream. As video content changes, the change of distribution is updated almost instantly.

Sample dashboard built using Imply Pivot on the output data

Using an SQL-like language, we pose queries in an ad-hoc fashion and can define the time scope as we desire. The scope can be in fact hybrid as Druid hides the notion between recent and historical data. Without spanning multiple system boundaries from a client perspective, we can perform analytics such as:

What are the four most recognized object classes during the last 4 weeks?
Obtain all edge locations where a person has been recognized in the last 5 minutes
Get the five most-frequently seen object classes on a weekday basis over the last year

Relying solely on Open Source or free solutions, one could alternatively use Apache Superset for the visualization part. Similarly, one could build a very similar system replacing Druid with Apache Pinot which belongs as well to the realm of real-time, fast OLAP query engines.

Key Results and Future Work

In our demo project, we have been able to embed the MLOps framework, Seldon core, into a streaming environment. Moreover, we have transitioned from stateless to stateful stream processing within that environment.

We also showcased how a hybrid cloud-edge can be set up and operated, exploiting the benefits of both environments. That means being able to serve low latency inference and analytics regardless of cloud connectivity on the edge, plus having the ability to carry out compute-intense real-time OLAP queries across edge locations in the cloud.

Multiple improvements and additional features were left for future work as they fall outside of the scope of this demo. Here we list a few which could be integrated into the architecture for production-grade solutions.

Monitoring deployed ML models: A key consideration in production environments is the ability to observe and validate a model’s performance. This is measured in terms of relevant KPIs and performance metrics. For this purpose, Seldon Core exposes various standard metrics which can be leveraged in addition to custom-built, use-case specific metrics. These can be scraped with Prometheus and visualized in a Grafana dashboard.
Automated re-training and model finetuning: This must be accomplished without affecting the deployed service. In addition, a roll-back path must be provided for use when necessary. This can be achieved with frameworks such as Kubeflow or Mlflow.