Machine Learning Model Serving Overview (Seldon Core, KFServing, BentoML, MLFlow)

Yinon Data
Israeli Tech Radar
Published in
8 min readNov 25, 2020

TLDR; I’m looking for a way to provide Data Scientists with tools to deploy a growing number of models independently, with minimal Engineering and DevOps efforts for each deployment. After considering several model serving solutions, I found Seldon Core to be the most suitable for this project’s needs

Edit August 27, 2021:

I’ve created a video tutorial for getting started with Seldon Core, watch it here:
ML Model Serving at Scale Tutorial — Seldon Core

ML Model Serving at Scale Tutorial — Seldon Core

Context

I’m currently building an ML based system for my client.
To give you a simplified context without getting too much into the details — the goal of the ML system is to help the main business system by providing real time predictions based on trained NLP models:

A deeper look inside the ML System will show multiple predictive models — each of them knows how to answer a specific question. The business system needs the ability to query any number of them in different permutations:

Orientation within the ML Space

The 2015 article Hidden Technical Debt in Machine Learning Systems featured the following figure:

In this post we’ll be focusing on the “Serving Infrastructure” part of it.

What is Model Serving?

To understand what model serving is, we’ll examine it from several perspectives: code and workflow.
Let’s start with the code. The following is a basic example taken from SKLearn tutorial:

We could conceptually divide the above code into two fragments: the training phase and the prediction phase.
The training phase ends when we dump the model to a file.
The prediction phase starts when we load it.

We can use the same phases when examining the ML development workflow.
The phases are characterized by which role in the data team is responsible for them, as well as which considerations are taken into account.

While the training phase is in the realm of Data Scientists, in which the considerations are in the lines of which algorithm will produce the best recall and precision rates (and probably many more), the prediction phase is in the domain of Data Engineers and DevOps, where the consideration would be in the lines of:

  • How to wrap the prediction code as a production-ready service?
  • How to ship and load the dumped model file?
  • Which API / Protocol to use?
  • Scalability, Throughput, Latency.
  • Deployments — How to deploy new model versions? How to rollback? Can we test it using Canary Deployments or Shadow Deployments?
  • Monitoring.
  • Which ML frameworks can we support? (e.g SKLearn, TensorFlow, XGBoost, Pytorch etc.)
  • How to wire custom pre and post-processing?
  • How to make the deployment process easy and accessible for Data Scientist?

Solutions

Thankfully there are several frameworks that provide solutions to some of the above considerations. We’ll present them in a high-level overview, compare them, and conclude with the one I chose for my project.

They are:

  • Just a REST API wrapper
  • “The K8 Model Serving Projects”: KFServing and Seldon Core
  • BentoML
  • MLFlow

Just a REST API wrapper

As simple as it sounds. For example - Tutorial: Serving Machine Learning Models with FastAPI in Python | by Jan Forster | Medium

The K8 Model Serving Projects

There are two popular model serving projects which are both built on Kubernetes:

KFServing

KFServing provides a Kubernetes Custom Resource Definition for serving machine learning (ML) models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.
It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU Autoscaling, Scale to Zero, and Canary Rollouts to your ML deployments. It enables a simple, pluggable, and complete story for Production ML Serving including prediction, pre-processing, post-processing and explainability. KFServing is being used across various organizations.

Seldon Core

Seldon core converts your ML models (Tensorflow, Pytorch, H2o, etc.) or language wrappers (Python, Java, etc.) into production REST/GRPC microservices.
Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out of the box including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries and more.

KFServing is a collaboration between several companies that are active in the ML Space (namely Seldon, Google, Bloomberg, NVIDIA, Microsoft, and IBM), to create a standardized solution for common ML Serving problems.
Hence it’s no surprise that the two are sharing similar mechanisms and even code components.

Seldon seems more mature as a project, with more comprehensive documentation, more frequent releases, and a community with an active Slack channel, as well as bi-weekly working group calls. It has proved to be extremely useful for me (now you know which one I chose ;)

Here you can find a detailed comparison between the two.
Next we’ll cover some of there main features:

Inference Servers

Seldon introduces the notion of Reusable Inference Servers vs. Non-Reusable Inference Servers.

It provides out of the box Prepackaged Model Servers for standard inference using SKLearn, XGBoost, Tensorflow, and MLflow.

In addition, it offers a mechanism for creating wrapper service for custom models in case you need it.
KFServing offers parallel options here.

Deployments

Seldon Core and KFServing have a similar approach to deploying model prediction services, which is based on using Kubernetes CRD (Custom Resource Definition).
For example:


apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: seldon-model
spec:
name: test-deployment
predictors:
— componentSpecs:
— spec:
containers:
— name: classifier
image: seldonio/mock_classifier:1.0
graph:
children: []
endpoint:
type: REST
name: classifier
type: MODEL
name: example
replicas: 1

And then you can deploy it with:

kubectl apply -f my_ml_deployment.yaml

Supported API Protocols

As both of these projects are in development, the protocols are changing too.
KFServing has Data Plane (V1) Protocol, while Seldon Core has its own Seldon Protocol. Both have support for Tensorflow Protocol.

The latest efforts regarding protocols are KFServing’s proposal of Data Plane (V2). It is still a work in progress, and so is the pre-packaged server implementing it, developed under Seldon (SeldonIO/MLServer)

Inference Graph

A notable difference between the two is that while KFServing is focused on the “simple” use case of serving a single model, Seldon Core allows more complex inference graphs, which may include multiple models chained together with ROUTERS, COMBINER, and TRANSFORMERS.

BentoML

A simple yet flexible workflow empowering Data Science teams to continuously ship prediction services

Unified model packaging format enabling both online and offline serving on any platform.

100x the throughput of your regular flask based model server, thanks to our advanced micro-batching mechanism. Read about the benchmarks [here](https://www.bentoml.ai/clkg/https/github.com/bentoml/benchmark) .

Deliver high quality prediction services that speaks the DevOps language and integrates perfectly with common infrastructure tools.

Inference Servers

BentoML’s approach to creating a prediction service is similar to Seldon Core and KFServing’s approach to creating wrappers for custom models.
The gist of it is subclassing a base class and implementing your prediction code there.

The main notable difference between “The K8 Projects” and BentoML would be the absence of reusable inference servers notion. Although you probably could implement it using BentoML’s framework, it’s approach is generally oriented towards creating non-reusable servers. Meanwhile “The K8 Projects” are explicitly offering both options and have a built-in framework support for re-usable servers (for example a container initializer that loads the model file from storage on boot).

Supported API Protocols

BentoML doesn’t have a standardized API protocol. You can implement whatever API Functions you want while implementing your custom-service, as long as you use API InputAdapters as input schema.

Deployments

This is another major difference between BentoML and “The K8 Projects”. While the latter is obviously built upon Kubernetes and have a streamlined deployment mechanism described above, BentoML is deployment platform-agnostic, and offers a wide variety of options. An interesting comparison point between those could be attained by looking at BentoML’s guide to deploying to Kubernetes cluster.

All in all, it’s not that different, except for the fact that the absence of re-usable inference servers forces you to build and push docker image for every model you want to deploy. This could become an issue when dealing with a growing number of models, and adds to the complexity of the process.

Another interesting synergy between the discussed solutions is Deploying a BentoML to KFServing.

MLFlow Models

An MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools — for example, real-time serving through a REST API or batch inference on Apache Spark. The format defines a convention that lets you save a model in different “flavors” that can be understood by different downstream tools.

Similar to BentoML, we end up with a deployable unit (which is either a containerized REST API server or a python function), and then we have to figure out how to deploy it.
BentoML’s docs goes as far as comparing it to MLFlow.

Here too there is a synergy between solutions, as Seldon Core has a pre-packaged inference server for MLFlow Models.

Summary

As I’ve stated before, I chose Seldon Core for this project.
The reasons are:

  • Frequent releases.
  • Extensive documentation.
  • Active community.
  • Pre-packaged inference servers.
  • Simple deployment process.

In my use case, ideally, I would like to provide Data Scientists with tools to deploy a large number of models independently, with minimal Engineering and DevOps efforts for each deployment. The approach of re-usable inference servers along with CRD based deployment seems to be most suitable for that.

Further reading

If you’d like to read more about tools for additional aspects of ML development lifecycle, such as metadata storage and data versioning, you can refer to this article by neptune.ai

--

--