Roll your own LLM on OKE with KServe

Ali Mukadam
Oracle Developers
Published in
7 min readMay 21, 2024
Photo by julien Tromeur on Unsplash

Most of the recent hype about Artificial Intelligence (AI) has been about ChatGPT and for those slightly more technically inclined, OpenAI. Perhaps it is justifiably so, given the ridiculously easy way the OpenAI developers have made it in order for people to use and build around their service.

That being said, the limitations of ChatGPT are well-known. Perhaps this is why companies are looking to run their AI workloads in the cloud, particularly on Kubernetes. As AI workloads compute requirements can range from simple to large and complex, deploying them on cloud and Kubernetes allow companies to benefit not only from economies of scale.

Lordy, I sound like an economist!

Anyway, by deploying AI workloads on Kubernetes, users can not only enjoy Kubernetes’ traditional benefits of scalability and extensibility but also build upon the cloud native extensive ecosystem such as CI/CD (e.g. Tekton, ArgoCD), application infrastructure (Service Mesh) and observability (Prometheus, Fluentd etc). Interestingly, this also comes at a time when Kubernetes has become considerably better at managing state and data. Talk about convergence!

So while working to ensure customers can use OKE as a great home to host their AI workloads, it occurred to me I should perhaps do more than just maintain a passing interest in order to make things even better for our users.

In this article, we’ll look at how you can use OKE to run your AI workloads. There are many ways you can do this but for the purpose of this article, I’m interested in something more specific: serving my own LLM using OKE. To that end, I’m going to use KServe.

What is KServe?

From the horse’s mouth, KServe:

  • is a Model Inference Platform that runs on Kubernetes, built for highly scalable use cases
  • provides a performant, standardized inference protocol across ML frameworks
  • supports serverless inference workload with Autoscaling, including Scale to Zero on GP
  • provides high scalability, density packing, intelligent routing using ModelMesh
  • provides simple and pluggable production serving including prediction, pre/post processing, monitoring and explainability
  • advanced deployments with canary rollout, experiments, ensembles and transformers

Unless you’re close to or in the Sanctum sanctorum of running your own AI workloads, most of this vernacular would probably mean nothing. It’s only when you have to run an AI workload that you start appreciating what the KServe community has achieved in building and I have to say, it does pack a wallop. I’ll unpack this for you but first, read Kaslin Fields’ excellent article where she distinguishes between the 3 types of AI workloads:

  1. training workloads: create and improve the model for intended use by feeding it massive amounts of data
  2. inference workloads: an actively running implementation of an AI model
  3. serving workloads: actively serving user requests

The author goes on to note that inference workloads can exist without serving users but a serving workload cannot exist without an inference workload.

Coming back to KServe, what the developers have neatly done is choose a number of open source cloud native and AI projects and assemble it together in a coherent whole:

KServe (source: KServe website)

KNative Serving provides the serverless layer and determines how the workload behaves. Istio provides the networking layer and Kubernetes provides the underlying foundation on which to run it all. This means you can run it on-premise, in the cloud or on edge as well as on a variety of hardware. As long as you can deploy Kubernetes in your target environment, you’re in business.

KServe is a Model Server that originated from the Kubeflow project. A Model Server allows you to provide inference workload as a service through an API. In other words, it gives you both the simplicity and convenience of running your own LLM, similar to OpenAI. By implementing different ClusterServingRuntimes, KServe is able to support a number of frameworks such as TensorFlow, PyTorch, XGBoost, ONNX and Scikit-learn among others.

By running KServe on Kubernetes, you can use the existing toolchain of the cloud native ecosystem. Want to secure your workload? Istio is here to handle authentication, authorization etc. Want to monitor your models? The metrics are captured and exposed in Prometheus format and visualized through Grafana dashboards.

Let’s take KServe for a spin on OKE.

Provision your cluster

First, let’s provision an OKE cluster. As of current versions, use Kubernetes version 1.29.1. This will allow us to use the latest releases for KServe as well as its dependencies such as Knative Serving and Istio. Provision a cluster with 3 nodes.

If you use Oracle Linux 8 images for your worker nodes, you also need to configure a few kernel modules in order for Istio to run successfully. The terraform-oci-oke module handles that easily for you. Chuck the following inside the worker_cloud_init variable before you create the node pools:

...
worker_cloud_init = [
{
content = <<-EOT
runcmd:
- 'echo "Kernel module configuration for Istio and worker node initialization"'
- 'modprobe br_netfilter'
- 'modprobe nf_nat'
- 'modprobe xt_REDIRECT'
- 'modprobe xt_owner'
- 'modprobe iptable_nat'
- 'modprobe iptable_mangle'
- 'modprobe iptable_filter'
- '/usr/libexec/oci-growfs -y'
- 'timedatectl set-timezone Australia/Sydney'
- 'curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh'
- 'bash -x /var/run/oke-init.sh'
EOT
content_type = "text/cloud-config",
}
]
...

Install Knative Serving

Once your OKE cluster is ready, we can begin the KServe installation and its components, with Knative serving the first cab off the rank. Create the custom resources for Knative:

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.14.0/serving-crds.yaml

Then, install the Knative Serving component:

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.14.0/serving-core.yaml

Next, we’ll install Istio as the networking layer.

Install Istio

Let’s customize our Istio Operator installation and create the following manifest:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
values:
global:
meshID: kserve
multiCluster:
clusterName: kserve
network: kserve
components:
egressGateways:
- name: istio-egressgateway
enabled: true
ingressGateways:
- name: istio-ingressgateway
enabled: true
k8s:
serviceAnnotations:
service.beta.kubernetes.io/oci-load-balancer-internal: "false"
service.beta.kubernetes.io/oci-load-balancer-shape: "flexible"
service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "50"
service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "100"
service.beta.kubernetes.io/oci-load-balancer-security-list-management-mode: "None"
oci.oraclecloud.com/oci-network-security-groups: "ocid1.networksecuritygroup..."

Replace the NSG OCID with yours and install Istio:

$ istioctl install -f istio.yaml
This will install the Istio 1.21.2 "default" profile (with components: Istio core, Istiod, Ingress gateways, and Egress gateways) into the cluster. Proceed? (y/N) y
✔ Istio core installed
✔ Istiod installed
✔ Egress gateways installed
✔ Ingress gateways installed
✔ Installation complete
Made this installation the default for injection and validation.

We can now integrate Istio with Knative Serving:

kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.14.0/net-istio.yaml

We also want to use Istio mTLS feature with Knative. First, let’s enable sidecar injection on knative-serving system namespace:

kubectl label namespace knative-serving istio-injection=enabled

Then, set PeerAuthentication to PERMISSIVE:

apiVersion: "security.istio.io/v1beta1"
kind: "PeerAuthentication"
metadata:
name: "default"
namespace: "knative-serving"
spec:
mtls:
mode: PERMISSIVE

Verify Istio status:

istioctl verify-install
...
...
Checked 14 custom resource definitions
Checked 3 Istio Deployments
✔ Istio is installed and verified successfully

We also need cert-manager:

helm repo add jetstack https://charts.jetstack.io
helm repo update
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.crds.yaml
helm install cert-manager --namespace cert-manager --version v1.14.4 jetstack/cert-manager --create-namespace

Install KServe

And finally, we can install KServe itself:

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve.yaml
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.0/kserve-cluster-resources.yaml

Verify all the pods are running before proceeding any further.

Configure DNS

We are not quite out of the woods yet as we need to configure DNS. Get the EXTERNAL-IP of Istio ingress gateway:

export INGRESS_NAME=istio-ingressgateway
export INGRESS_NS=istio-system
export INGRESS_HOST=$(kubectl -n "$INGRESS_NS" get service "$INGRESS_NAME" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
echo $INGRESS_HOST

Edit your DNS zone and add an ‘A’ Record to point to the IP address. Alternatively, you can also get Kubernetes to handle this automatically by using external-dns. You’ll be pleased to know that we’ve also added OKE Workload Identity support to external-dns. So, if you are using an enhanced OKE cluster, you don’t need to stash your key inside a secret anymore.

Finally, patch the Knative ConfigMap:

# Replace knative.example.com with your domain suffix
kubectl patch configmap/config-domain \
--namespace knative-serving \
--type merge \
--patch '{"data":{"knative.example.com":""}}'

Let’s take this for a quick spin using the KServe “Getting Started” example. This is just to make sure we are in business.

Taking KServe for a spin

Begin by creating a namespace:

kubectl create namespace kserve-test

Create an InferenceService:

kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
EOF

Give it a moment while the model is being downloaded and then check the InferenceService status:

kubectl get inferenceservices sklearn-iris -n kserve-test

Prepare your inference input request:

cat <<EOF > "./iris-input.json"
{
"instances": [
[6.8, 2.8, 4.8, 1.4],
[6.0, 3.4, 4.5, 1.6]
]
}
EOF

And send the request:

SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kserve-test -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json | jq

You should be able to see something like:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload Upload Total Spent Left Speed
100 97 100 21 100 76 1500 5428 --:--:-- --:--:-- --:--:-- 6928
{
"predictions": [
1,
1
]
}

As in the KServe example, you should see 2 predictions returned:

{
"predictions": [
1,
1
]
}

Let’s now do a load test as in the example:

kubectl create -f https://raw.githubusercontent.com/kserve/kserve/release-0.11/docs/samples/v1beta1/sklearn/v1/perf.yaml -n kserve-test

And check the output:

kubectl logs -f load-testzrjlx-p7zk2
Requests [total, rate, throughput] 30000, 500.02, 500.00
Duration [total, attack, wait] 1m0s, 59.998s, 2.24ms
Latencies [min, mean, 50, 90, 95, 99, max] 2.116ms, 2.467ms, 2.383ms, 2.71ms, 2.78ms, 3.651ms, 136.809ms
Bytes In [total, mean] 630000, 21.00
Bytes Out [total, mean] 2460000, 82.00
Success [ratio] 100.00%
Status Codes [code:count] 200:30000
Error Set:

Summary

In this post, we took KServe for a spin on OCI and deployed it on OKE. We then deployed our own LLM and made a basic test which was successful. In a future post, we’ll dive deeper and understand a bit more about the benefits of rolling your own LLM on OKE.

--

--