Published in


Scalable Machine Learning with Kafka Streams and KServe

Photo by Kyle Hinkson on Unsplash

Use Case

Popular libraries such as PyTorch, TensorFlow, or scikit-learn have become the de-facto standards for modern ML. However, using ML models created with these libraries on stream data is challenging. This is especially true if scalability, fault-tolerance, and exactly-once processing must be guaranteed. Another requirement is that Data Scientists have to continuously refine models easily.


KServe is a scalable, standardized model serving platform for Kubernetes. It provides strong abstractions to ease the delivery of ML models as services in Kubernetes. Deploying your model to a Kubernetes cluster is as easy as providing an ML model on your object storage of choice and a small Kubernetes YAML resource definition file. KServe publishes it via a standard inference protocol (REST/gRPC). It handles networking, authentication and model versioning. It provides a solution not only for the prediction but also for pre/post processing of data, monitoring and explainability. Further, the inference service will be automatically scaled by KServe depending on the load.


The goal of the demo is to translate English texts to Spanish using a Deep Learning Model. First, English tweets from the Twitter API are produced into a topic. Then, the Kafka Streams app reads from there and calls our custom translation predictor to obtain a translation. Finally, the Spanish translation is written to an output topic.

A Diagram which shows the data flow inside a Kubernetes cluster from the Text Producer to the Kafka Streams Translation App which calls the MLServer Translator in KServe
Architecture overview of the demo system

Building a Custom Predictor with MLServer

To obtain the translation, we use the Argos Translate library, which is an open-source offline translation library written in Python. It conveniently abstracts the translation process from preliminary NLP tasks such as tokenization. We wrap the Argos model and serve this through MLServer. For simplicity’s sake, we load the model from the project directory instead of from cloud storage.

Building the Kafka Streams Application

We demonstrate the Java Kafka Streams app which calls the MLServer inference service. The app class forms the entry point for the Translation Kafka Streams app.
To connect to Kafka, it inherits functionality from the generic KafkaStreamsApplication from our Kafka Streams Bootstrap project. Applications created with this project automatically send a dead letter to a dedicated error topic in case a record cannot be processed. To send requests to KServe, it makes use of our open-source Java KServe Client. The library includes a mechanism to automatically retry requests to the inference service in case the request times out, e.g. when the inference service is scaled to zero initially.

Running the Demo

Clone our demo from

Scaling Kafka Streams and KServe

With these basic components up and running, the question of how to scale this system arises.

  • incoming messages
  • replicas of our Kafka Streams app
  • predictors of the MLServer Translator
Number of Ready Kafka Streams App Replicas
Number of MLServer Translator Predictors
Messages in per Second


We have demonstrated how Kafka Streams and KServe can be combined to build a scalable system for ML inference. This way, we can leverage Kafka Streams’ scalability, fault-tolerance, and exactly-once stream data processing. KServe adds compatibility with current ML frameworks and auto-scaling features to the mix.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store