BentoML Model Serving on Knative

Hirako2000

Published in

Kubernetes In Practice

3 min readOct 3, 2020

Deploying machine learning models for serving on Kubernetes provides useful auto scaling features, and metrics gathering out of the box with the native metrics-server.

This article outlines the steps to deploy Models servers onto Kubernetes, leveraging BentoML for the packaging of models and server instances, and Knative to make deployment even simpler than having to manage pods and define auto-scaling groups.

BentoML steps

Getting started with BentoML is pretty easy, the Quickstart guide is a good place to start.

1/ Install BentoML and scikit-learn locally

pip install bentoml scikit-learn

2/ Clone the repo containing the example

git clone git@github.com:bentoml/BentoML.git

3/ Install python dependencies for this training and execute:

pip install -r ./BentoML/guides/quick-start/requirements.txt
python ./BentoML/guides/quick-start/main.py

The Python script does the training and output some IrisClassifier model, to verify the outcome:

bentoml get IrisClassifier:latest

Containerise

BentoML ships with a feature to containerise the model server with a given model, use it like so to produce a container image:

bentoml containerize IrisClassifier:latest -t iris-classifier:0.0.1

This builds and save the image, tagged iris-classifier with our first version.

Push the image

docker push -t iris-classifier:0.0.1

Knative deployment

Create a namespace, named bentoml then apply this yaml configuration to deploy the model server from the image we’ve created:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: iris-classifier
  namespace: bentoml
spec:
  template:
    spec:
      containers:
        - image: docker.io/{docker_username}/iris-classifier:0.0.1
          livenessProbe:
            httpGet:
              path: /healthz
            initialDelaySeconds: 3
            periodSeconds: 5
          readinessProbe:
            httpGet:
              path: /healthz
            initialDelaySeconds: 3
            periodSeconds: 5
            failureThreshold: 3
            timeoutSeconds: 60

That’s all it takes to have your model serving deployed. Knative provides a route, and you can invoke the model:

curl -i \
  --header "Content-Type: application/json" \
  --request POST \
  --data '[[5.1, 3.5, 1.4, 0.2]]' \
  https://iris-classifier.k8cluster.xyz/predict

Knative will spin up more instances under load, accordingly, and will scale down to zero when no request flow in.

Traffic splitting

Let’s create another iris classifier model and deploy it as an experimental instance.

First let’s alter the training so that we generate a different model. Just removing some entries from the training data set will do, here is a sed command to remove the first 12 rows of the training data set:

sed -i 1,12d ./BentoML/guides/quick-start/iris_data.csv

Then train a model:

python ./BentoML/guides/quick-start/main.py

And containerise then push the serving image containing this new model:

bentoml containerize IrisClassifier:latest -t iris-classifier:0.0.2
docker push -t iris-classifier:0.0.2

Update the Knative service, the only thing to change is the container’s image version, like so:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: iris-classifier
  namespace: bentoml
spec:
  template:
    spec:
      containers:
        - image: docker.io/{docker_username}/iris-classifier:0.0.2
          livenessProbe:
            httpGet:
              path: /healthz
            initialDelaySeconds: 3
            periodSeconds: 5
          readinessProbe:
            httpGet:
              path: /healthz
            initialDelaySeconds: 3
            periodSeconds: 5
            failureThreshold: 3
            timeoutSeconds: 60

Once applied, Knative creates a revision of the service.

Add the traffic splitting values:

traffic:
  - percent: 90
    revisionName: iris-classifier-first  # revision labels may vary
  - percent: 10
    revisionName: iris-classifier-second

Apply the change, and only 10% of the requests will get routed to the second model server.