Serving AI Models as APIs on the Edge using k3s— Part 2
Introduction
This is part 2 of a series where we show how to serve AI models as Inference APIs that can be invoked remotely or as part of an inference workflow.
Part 1 showed how to package and serve the AI model gpt2
Part 2 will show how to push your AI model container image to a container registry, deploy it onto k3s, a very lightweight version of Kubernetes that works on both x86 and ARM platforms.
The source for this deployment is at:
https://github.com/sparquelabs/ai-serving/tree/main/cogs/textgen-gpt2
Steps
Publish your AI model container image to registry
We assume that you have setup a repository on a Container Registry, in this case we use AWS Elastic Container Registry on AWS. The following allows you to login to AWS ECR, tag your local image and push it to ECR.
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com
docker tag textgen-gpt2:latest <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com/textgen-gpt2:latest
docker push <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com/textgen-gpt2:latest
Setting up k3s to read from AWS ECR
We need to set up our local k3s cluster to be able to read from AWS ECR. We do this by using the following:
sudo cat<<EOF >> /tmp/registries.yaml
mirrors:
docker.io:
endpoint:
- "https://782340374253.dkr.ecr.us-east-1.amazonaws.com:5000"
configs:
782340374253.dkr.ecr.us-east-1.amazonaws.com:
auth:
username: AWS
password: ${ECR_TOKEN}
EOF
# add this to k3s
sudo mv /tmp/registries.yaml /etc/rancher/k3s/registries.yaml
# restart k3s
sudo systemctl force-reload k3s
Kubernetes Deployment for containerized AI model
Now we deploy the containerized AI model for gpt2
to k3s by using a deployment.
You will notice that we deploy only 1 replica at this point. You can make this multiple replicas to add more capacity and redundancy, but the benefits will vary on your Compute capacity as the model’s CPU compute requirements are reasonably heavy for inferencing activities.
You will also notice that we have deployed the model’s Service as a ClusterIP at this point. We will add an ingress at a later time.
Right now, we just want to deploy the model as a deployment pod(s), and test it by invoking its prediction end point.
---
apiVersion: v1
kind: Service
metadata:
name: textgen-gpt2
labels:
app: textgen-gpt2
spec:
type: ClusterIP
ports:
- port: 5000
targetPort: 5000
protocol: TCP
selector:
app: textgen-gpt2
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: textgen-gpt2
labels:
app: textgen-gpt2
spec:
replicas: 1
selector:
matchLabels:
app: textgen-gpt2
template:
metadata:
labels:
app: textgen-gpt2
spec:
containers:
- name: textgen-gpt2
image: <your-ecr-registry>:5000/textgen-gpt2
imagePullPolicy: IfNotPresent # or Never
ports:
- containerPort: 5000
Deploy model to k3s
Now we can deploy this model’s deployment directive to k3s.
# deploy the model to k3s
kubectl apply -f textgen-gpt2-deploy.yaml
service/textgen-gpt2 created
deployment.apps/textgen-gpt2 created
# check if it is running
kubectl get pods
NAME READY STATUS RESTARTS AGE
textgen-gpt2-f5bf546f7-t5rfh 1/1 Running 0 23m
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
textgen-gpt2 ClusterIP 10.43.33.9 <none> 5000/TCP 35m
You will notice that the model’s image has been pulled from the registry and is available for serving.
You will also notice that the model’s service is also available as a Cluster IP for invocation.
At this point, we will not use an Ingress yet.
Invoking the model’s Predictor end point
Now we will invoke the model’s Predictor. We will kubectl’s port-forward utility to invoke the model’s Predictor API endpoint. At a later time, we will create an ingress for invoking it without using the port-forward utility.
# in terminal 1, port-forward the service
kubectl port-forward service/textgen-gpt2 5000:5000
# in terminal 2
# invoke the service
curl -s -X POST -H 'Content-Type: application/json' http://localhost:5000/predictions -d '{"input": {"prompt":"The sailor sailed into the "}}' | jq '.output'
"[{'generated_text': 'The sailor sailed into the vernal darkness and began to scream.\\n\\n\"My dear, what\\'s happening?\"\\n\\n\"There\\'s a fire burning in the water…\"\\n\\n\"I can\\'t see anything.\"\\n\\n\"Oh God'}]"
Summary
You will notice that we have:
- successfully deployed the AI model’s container in k3s which can be operated as a cluster on edge nodes or anywhere else you need.
- successfully invoked the AI model’s service end point for a prediction