From Dev to Production: Deploying HuggingFace BERT with KServe
The Future of NLP Deployment: BERT Models and KServe in Action
In this post, I will demonstrate how to deploy a HuggingFace pre-trained model (BERT for text classification with the Hugging Face Transformers library) to run as a KServe-hosted model.
First, let’s understand what is KServe and why we need KServe.
🤔What is KServe?
KServe was initially called KFServing (KubeFlow Serving) and was designed so that model serving could be operated in a standardized way across frameworks right out of the box. There was a need for a model serving system, that could easily run on existing Kubernetes and Istio stacks and also provide model explainability, inference graph operations, and other model management functions.
🤷♂️Why KServe?
- KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases.
- Provides performant, standardized inference protocol across ML frameworks (Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX)
- Support modern serverless inference workload with Autoscaling including Scale to Zero on GPU.
- Simple and Pluggable production serving for production ML serving including prediction, pre/post-processing, monitoring, and explainability.
- Advanced deployments with the canary rollout, experiments, ensembles, and transformers.
🛠️Setting Up KServe
To demo the Hugging Face model on KServe we’ll use the local (Windows OS) quick install method on a minikube kubernetes cluster. The standalone “quick install” installs Istio and KNative for us without having to install all of Kubeflow and the extra components that tend to slow down local demo installs.
Let’s start the minikube cluster once our local minikube installation is completed.
# Start the minikube cluster
minikube start
# Check the status of minikube cluster
minikube status
The second command should give us the output below if our cluster is healthy:
First, we need to get a copy of the KServe repository on our local system. Use git bash to clone the KServe repository.
cd kubeflow
git clone https://github.com/kserve/kserve.git
We can’t download the Istio 1.17.2 due to some issue, hence we can download the Istio 1.17.2 from the release page for Windows. Extract the istio-1.17.2-win.zip
file and place the istio-1.17.2
folder under kubeflow directory.
cd kubeflow
./hack/quick_install.sh
This will install KServe along with its core dependencies such as Knative Serving all with the same install script. This install takes around 30–60 seconds, depending on your system.
Note: Sometimes the installer will fail because a component still has not been completely installed, just run the installer a second time if you see the failure console logs.
Once our installation is complete, we can confirm that the KServe install is working on our minikube cluster with the command.
kubectl get pods -n kserve
We can also list the all pods under the minikube cluster.
kubectl get pods -A
🚀Deploying the Custom HuggingFace Model Server on KServe
There are two main ways to deploy a model as an InferenceService on KServe:
- Deploy the saved model with a pre-built model server on a pre-existing image
- Deploy a saved model already wrapped in a pre-existing container as a custom
Most of the time we want to deploy on a pre-built model server as this will create the least amount of work for our engineering team.
There are many pre-built model servers included with KServe out of the box. With KServe our built-in model server options are:
- tensorflow
- sklearn
- pytorch
- onnx
- tensorrt
- xgboost
Sometimes we’ll have a model that will not wire up correctly with the pre-built images. The reasons this could happen include:
- Model built with different dependency versions than the model server
- Model not saved in file format model server expects
- Model was built with a new/custom framework not yet supported by KServe
- Model is in a container image that has a REST interface that is different than the Tensorflow V1 HTTP API that KServe expects
For any of the cases above we have 3 options for deploying our model:
- Wrap our custom model in our own container where our container runs its own web server to expose the model endpoint
- Use the KServe Model Server as the webserver (with its standard Tensorflow V1 API) and then overload the load() and predict() methods
- Deploy a pre-built container image with a custom REST API, bypassing
InferenceService
and sending the HTTP request directly to the predictor
Of the 3 options, using Model Server and just doing custom overloads will likely be the most popular route for folks just wanting to deploy a custom model.
Given that Hugging Face has a unique Python API and a lot of dependencies, it does not work on KServe out of the box. In this case, we need to do 2 key tasks:
- Create a new python class that inherits from KServe Model class, with custom methods for
load()
andpredict()
- Build a custom container image and then store it in a container repository
The remainder of this post will be focused on:
- Building a custom model Python
kserve.Model
with the Hugging Face BERT model wired in. - Building a docker container with the custom python
kserve.Model
and push the docker container to docker hub - Deploy the custom
InferenceService
to our minikube Kubernetes cluster - Test the KServe-hosted HuggingFace model
Now let’s get to work building out our custom text classification InferenceService
on KServe.
1. Building a Custom Python Model Server
In the code below we can see our custom Model
with the Hugging Face code wired into the load()
and predict()
methods.
from typing import Dict
import kserve
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from kserve import ModelServer
import logging
class KServeBERTSentimentModel(kserve.Model):
def __init__(self, name: str):
super().__init__(name)
KSERVE_LOGGER_NAME = 'kserve'
self.logger = logging.getLogger(KSERVE_LOGGER_NAME)
self.name = name
self.ready = False
def load(self):
# Build tokenizer and model
name = "distilbert-base-uncased-finetuned-sst-2-english"
self.tokenizer = AutoTokenizer.from_pretrained(name)
self.model = AutoModelForSequenceClassification.from_pretrained(name, torchscript=True)
self.ready = True
def predict(self, request: Dict, headers: Dict) -> Dict:
sequence = request["sequence"]
self.logger.info(f"sequence:-- {sequence}")
inputs = self.tokenizer(
sequence,
return_tensors="pt",
max_length=128,
padding="max_length",
truncation=True,
)
# run prediciton
with torch.no_grad():
predictions = self.model(**inputs)[0]
scores = torch.nn.Softmax(dim=1)(predictions)
results = [{"label": self.model.config.id2label[item.argmax().item()], "score": item.max().item()} for item in scores]
self.logger.info(f"results:-- {results}")
# return dictonary, which will be json serializable
return {"predictions": results}
if __name__ == "__main__":
model = KServeBERTSentimentModel("bert-sentiment")
model.load()
model_server = ModelServer(http_port=8080, workers=1)
model_server.start([model])
There are two things happening in the above code with respect to integrating with the model server:
- The Hugging Face BERT model is loaded in the load(…) method
- The
predict(...)
method takes incoming inference input from the REST call and passes it to the Hugging FaceAutoModelForSequenceClassification
model instance
The Hugging Face model we’re using here is the “distilbert-base-uncased-finetuned-sst-2-english”. This model and associated tokenizer are loaded from pre-trained model checkpoints included in the Hugging Face framework.
2. Building a new Docker image for the Model Server
Once our model serving code above is saved locally, we will build a new docker container image with the code and required dependencies packaged inside. We can see examples of the container build command and the container repository store command (here, docker hub) below.
Build the new container with our custom code and then send it over to the container repository of your choice:
# Build the container on your local machine
docker build -t kserve-custom-model .
# Push the container to docker registry
docker push {username}/kserve-model-repo:v1.0
For those that would prefer to use a pre-built version of this container and skip the coding + docker steps, just use my container up on the docker hub:
Now let’s move on to deploying our model server in our container as an InferenceService
on KServe.
3. Deploying Custom Model Server on KServe with kubectl
Given that KServe treats models as infrastructure, we deploy a model on KServe with a yaml file to describe the k8s model resource (e.g., InferenceService) as a custom object. The code listing below shows our yaml file to create our custom InferenceService object on the local k8s cluster.
We need to set four parameters to uniquely identify the model, such as:
- apiVersion: “serving.kserve.io/v1beta1”
- kind: “InferenceService”
- metadata.name: [the model’s unique name inside the namespace]
- metadata.namespace: [the namespace your model will live in]
Here we’re using the generic kserve-custom-model
as our metadata.name
and our model will be created in the default namespace.
Towards the end of the spec we ask kubernetes to schedule our container wtih 4GB of ram as Hugging Face tends to take up a lot of space in memory.
Once we have our yaml file configured we can create the Kubernetes object with Kubectl as shown below.
kubectl apply -f deploy_bert_sentiment.yaml
Once we run the above kubectl
command, we should have a working InferenceService running on our local kubernetes cluster. We can check the status of our model with the kubectl
command:
kubectl get inferenceservices
This should give us output as shown below.
Deploying a custom model on KServe is not as easy as using a pre-built model server, but it’s not terrible either as we’ve seen so far.🚀🚀
4. Test the KServe-hosted HuggingFace Model
Now let’s make an inference call to our locally hosted Hugging Face Sentiment Analysis model on KServe. First, we need to do some port forwarding work so our model’s port is exposed to our local system with the command:
kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80
We’ll use the curl
command to send the input json file as input to the predict method on our custom Hugging Face InferenceService
on KServe with the command:
curl -v -H "Host: kserve-custom-model.default.example.com" http://localhost:8080/v1/models/bert-sentiment:predict -d @./input.json
The response will look like:
This example has shown how to take a non-trivial NLP model and host it as a custom InferenceService on KServe.
I hope you enjoyed this blog post, if you have any questions, feel free to contact me on LinkedIn and share your experience in the comment section.
The complete source code for this post is available in the following link.
Become a WRITER at MLearning.ai / Good-Bad AI Art / The 100+ List AI agents