Serving AI Models as APIs on the Edge using k3s— Part 2

Published in

Sparque labs

3 min readAug 5, 2023

Introduction

This is part 2 of a series where we show how to serve AI models as Inference APIs that can be invoked remotely or as part of an inference workflow.

Part 1 showed how to package and serve the AI model gpt2

Part 2 will show how to push your AI model container image to a container registry, deploy it onto k3s, a very lightweight version of Kubernetes that works on both x86 and ARM platforms.

The source for this deployment is at:

https://github.com/sparquelabs/ai-serving/tree/main/cogs/textgen-gpt2

Steps

Publish your AI model container image to registry

We assume that you have setup a repository on a Container Registry, in this case we use AWS Elastic Container Registry on AWS. The following allows you to login to AWS ECR, tag your local image and push it to ECR.

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com

docker tag textgen-gpt2:latest <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com/textgen-gpt2:latest

docker push <your-registry-name>.dkr.ecr.us-east-1.amazonaws.com/textgen-gpt2:latest

Setting up k3s to read from AWS ECR

We need to set up our local k3s cluster to be able to read from AWS ECR. We do this by using the following:

sudo cat<<EOF >> /tmp/registries.yaml
mirrors:
docker.io:
    endpoint:
        - "https://782340374253.dkr.ecr.us-east-1.amazonaws.com:5000"
configs:
782340374253.dkr.ecr.us-east-1.amazonaws.com:
    auth:
        username: AWS
        password: ${ECR_TOKEN}
EOF

# add this to k3s
sudo mv /tmp/registries.yaml /etc/rancher/k3s/registries.yaml

# restart k3s
sudo systemctl force-reload k3s

Kubernetes Deployment for containerized AI model

Now we deploy the containerized AI model for gpt2 to k3s by using a deployment.

You will notice that we deploy only 1 replica at this point. You can make this multiple replicas to add more capacity and redundancy, but the benefits will vary on your Compute capacity as the model’s CPU compute requirements are reasonably heavy for inferencing activities.

You will also notice that we have deployed the model’s Service as a ClusterIP at this point. We will add an ingress at a later time.

Right now, we just want to deploy the model as a deployment pod(s), and test it by invoking its prediction end point.

---
apiVersion: v1
kind: Service
metadata:
  name: textgen-gpt2
  labels:
    app: textgen-gpt2
spec:
  type: ClusterIP 
  ports:
  - port: 5000
    targetPort: 5000
    protocol: TCP
  selector:
    app: textgen-gpt2
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: textgen-gpt2 
  labels:
    app: textgen-gpt2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: textgen-gpt2
  template:
    metadata:
      labels:
        app: textgen-gpt2
    spec:
      containers:
      - name: textgen-gpt2
        image: <your-ecr-registry>:5000/textgen-gpt2
        imagePullPolicy: IfNotPresent # or Never
        ports:
        - containerPort: 5000

Deploy model to k3s

Now we can deploy this model’s deployment directive to k3s.

# deploy the model to k3s
kubectl apply -f textgen-gpt2-deploy.yaml

service/textgen-gpt2 created
deployment.apps/textgen-gpt2 created

# check if it is running
kubectl get pods

NAME                                 READY   STATUS    RESTARTS      AGE
textgen-gpt2-f5bf546f7-t5rfh         1/1     Running   0             23m

kubectl get svc

NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
textgen-gpt2   ClusterIP   10.43.33.9      <none>        5000/TCP   35m

You will notice that the model’s image has been pulled from the registry and is available for serving.

You will also notice that the model’s service is also available as a Cluster IP for invocation.

At this point, we will not use an Ingress yet.

Invoking the model’s Predictor end point

Now we will invoke the model’s Predictor. We will kubectl’s port-forward utility to invoke the model’s Predictor API endpoint. At a later time, we will create an ingress for invoking it without using the port-forward utility.

# in terminal 1, port-forward the service
kubectl port-forward service/textgen-gpt2 5000:5000

# in terminal 2
# invoke the service
curl -s -X POST -H 'Content-Type: application/json' http://localhost:5000/predictions   -d '{"input": {"prompt":"The sailor sailed into the "}}' | jq '.output'

"[{'generated_text': 'The sailor sailed into the vernal darkness and began to scream.\\n\\n\"My dear, what\\'s happening?\"\\n\\n\"There\\'s a fire burning in the water…\"\\n\\n\"I can\\'t see anything.\"\\n\\n\"Oh God'}]"

Summary

You will notice that we have:

successfully deployed the AI model’s container in k3s which can be operated as a cluster on edge nodes or anywhere else you need.
successfully invoked the AI model’s service end point for a prediction