Serving AI Models as APIs on the Edge using k3s— Part 2

Sanjay M
Sparque labs
Published in
3 min readAug 5, 2023
Photo by Mohamed Nohassi on Unsplash

Introduction

This is part 2 of a series where we show how to serve AI models as Inference APIs that can be invoked remotely or as part of an inference workflow.

Part 1 showed how to package and serve the AI model gpt2

Part 2 will show how to push your AI model container image to a container registry, deploy it onto k3s, a very lightweight version of Kubernetes that works on both x86 and ARM platforms.

The source for this deployment is at:

https://github.com/sparquelabs/ai-serving/tree/main/cogs/textgen-gpt2

Steps

Publish your AI model container image to registry

We assume that you have setup a repository on a Container Registry, in this case we use AWS Elastic Container Registry on AWS. The following allows you to login to AWS ECR, tag your local image and push it to ECR.

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com

docker tag textgen-gpt2:latest <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com/textgen-gpt2:latest

docker push <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com/textgen-gpt2:latest

Setting up k3s to read from AWS ECR

We need to set up our local k3s cluster to be able to read from AWS ECR. We do this by using the following:

sudo cat<<EOF >> /tmp/registries.yaml
mirrors:
docker.io:
endpoint:
- "https://782340374253.dkr.ecr.us-east-1.amazonaws.com:5000"
configs:
782340374253.dkr.ecr.us-east-1.amazonaws.com:
auth:
username: AWS
password: ${ECR_TOKEN}
EOF

# add this to k3s
sudo mv /tmp/registries.yaml /etc/rancher/k3s/registries.yaml

# restart k3s
sudo systemctl force-reload k3s

Kubernetes Deployment for containerized AI model

Now we deploy the containerized AI model for gpt2 to k3s by using a deployment.

You will notice that we deploy only 1 replica at this point. You can make this multiple replicas to add more capacity and redundancy, but the benefits will vary on your Compute capacity as the model’s CPU compute requirements are reasonably heavy for inferencing activities.

You will also notice that we have deployed the model’s Service as a ClusterIP at this point. We will add an ingress at a later time.

Right now, we just want to deploy the model as a deployment pod(s), and test it by invoking its prediction end point.

---
apiVersion: v1
kind: Service
metadata:
name: textgen-gpt2
labels:
app: textgen-gpt2
spec:
type: ClusterIP
ports:
- port: 5000
targetPort: 5000
protocol: TCP
selector:
app: textgen-gpt2
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: textgen-gpt2
labels:
app: textgen-gpt2
spec:
replicas: 1
selector:
matchLabels:
app: textgen-gpt2
template:
metadata:
labels:
app: textgen-gpt2
spec:
containers:
- name: textgen-gpt2
image: <your-ecr-registry>:5000/textgen-gpt2
imagePullPolicy: IfNotPresent # or Never
ports:
- containerPort: 5000

Deploy model to k3s

Now we can deploy this model’s deployment directive to k3s.

# deploy the model to k3s
kubectl apply -f textgen-gpt2-deploy.yaml

service/textgen-gpt2 created
deployment.apps/textgen-gpt2 created

# check if it is running
kubectl get pods

NAME READY STATUS RESTARTS AGE
textgen-gpt2-f5bf546f7-t5rfh 1/1 Running 0 23m

kubectl get svc

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
textgen-gpt2 ClusterIP 10.43.33.9 <none> 5000/TCP 35m

You will notice that the model’s image has been pulled from the registry and is available for serving.

You will also notice that the model’s service is also available as a Cluster IP for invocation.

At this point, we will not use an Ingress yet.

Invoking the model’s Predictor end point

Now we will invoke the model’s Predictor. We will kubectl’s port-forward utility to invoke the model’s Predictor API endpoint. At a later time, we will create an ingress for invoking it without using the port-forward utility.

# in terminal 1, port-forward the service
kubectl port-forward service/textgen-gpt2 5000:5000

# in terminal 2
# invoke the service
curl -s -X POST -H 'Content-Type: application/json' http://localhost:5000/predictions -d '{"input": {"prompt":"The sailor sailed into the "}}' | jq '.output'

"[{'generated_text': 'The sailor sailed into the vernal darkness and began to scream.\\n\\n\"My dear, what\\'s happening?\"\\n\\n\"There\\'s a fire burning in the water…\"\\n\\n\"I can\\'t see anything.\"\\n\\n\"Oh God'}]"

Summary

You will notice that we have:

  • successfully deployed the AI model’s container in k3s which can be operated as a cluster on edge nodes or anywhere else you need.
  • successfully invoked the AI model’s service end point for a prediction

--

--