Build a custom Chatbot using Hugging Face Chat UI and Cosmos DB on Azure Kubernetes Service

7 min readMar 27, 2024

Chatbot Application using Azure Kubernetes, MongoDB, Hugging Face TGI and Chat UI

This article serves as a detailed guide on how to deploy a chat app like ChatGPT with backend storage and LLM server on Azure Kubernetes Service. For LLM inference, we’ll use Hugging Face Text Generation Inference engine (same as my previous article).

Step 1: Set up backend database for storing conversations

For this, we’ll use a hosted version of Mongo DB as part of Azure’s Cosmos DB offering. Search for ‘Azure Cosmos DB for MongoDB (vCore)’ and create a MongoDB resource with default options. We’ll call it “chatmongo” and navigate to its Settings > Connection strings and copy the string which will look something like this:

mongodb+srv://<user>:<password>@chatmongo.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000

Step 2: Prepare a docker image from Chat UI repo

For the frontend, we’ll use Hugging Face Chat UI which is also the same core application behind HuggingChat app.

Clone the repository:

git clone https://github.com/huggingface/chat-ui.git

Update the env.local file and add your MongoDB, HF Token and model details as follows. Remember to replace <user> and <password> in the MongoDB connection string. The most important parameter is the “endpoint” that tells Chat UI that we’ll be sending requests to a TGI server running at localhost. If you want to use a different model, update the model parameters accordingly as outlined here.

MONGODB_URL=YOUR_MONGODB_URL
HF_TOKEN=YOUR_HUGGING_FACE_TOKEN

MODELS=`[
    {
    "name" : "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "description" : "The latest MoE model from Mistral AI! 8x7B and outperforms Llama 2 70B in most benchmarks.",
    "logoUrl": "https://huggingface.co/datasets/huggingchat/models-logo/resolve/main/mistral-logo.png",
    "websiteUrl" : "https://mistral.ai/news/mixtral-of-experts/",
    "modelUrl": "https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1",
    "preprompt" : "",
    "chatPromptTemplate": "<s> {{#each messages}}{{#ifUser}}[INST]{{#if @first}}{{#if @root.preprompt}}{{@root.preprompt}}\n{{/if}}{{/if}} {{content}} [/INST]{{/ifUser}}{{#ifAssistant}} {{content}}</s> {{/ifAssistant}}{{/each}}",
    "parameters" : {
      "temperature" : 0.6,
      "top_p" : 0.95,
      "repetition_penalty" : 1.2,
      "top_k" : 50,
      "truncate" : 1024,
      "max_new_tokens" : 1024,
      "stop" : ["</s>"]
    },
    "promptExamples" : [
      {
        "title": "Write an email from bullet list",
        "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
      }, {
        "title": "Code a snake game",
        "prompt": "Code a basic snake game in python, give explanations for each step."
      }, {
        "title": "Assist in a task",
        "prompt": "How do I make a delicious lemon cheesecake?"
      }
    ],
    "endpoints": [{
    "type" : "tgi",
    "url": "http://localhost:80",
    }]
  }
]`

In Azure, search for Azure Container Registry and follow on-screen instructions to create a resource.

If you’re like me and have a trial subscription for Azure, you’ll most likely not be able to acr build. Instead, make the changes locally and push the docker image to your Azure Container Registry.

# log into your Azure Container Registry
az acr login --name YOUR_ACR_NAME

# build docker image
docker buildx build --platform linux/amd64 -t YOUR_ACR_SERVER_NAME/chatui2 -f Dockerfile.local .

# push image to ACR
docker push YOUR_ACR_SERVER_NAME/chatui2:latest

Step 3: Setup Azure Kubernetes Service

On Azure, I created an AKS cluster with a single-node system pool and a single-node user pool. (I also used Standard_A2m_v2 SKU to avoid exceeding quota limit since I was using my trial subscription). You can adjust node count and SKUs depending on your requirements.

export RESOURCE_GROUP=rgk8s
export CLUSTER_NAME=k8s2
export LOCATION=eastus

az group create --name=$RESOURCE_GROUP --location=$LOCATION
az aks create --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --node-count 1 --generate-ssh-keys --node-vm-size Standard_A2m_v2 --network-plugin azure
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $CLUSTER_NAME --name userpool --node-count 1 --node-vm-size Standard_A2m_v2
az aks get-credentials --name $CLUSTER_NAME --resource-group $RESOURCE_GROUP
kubectl get nodes

Once the infrastructure is ready, we’ll create a kubernetes secret to store our Azure Container Registry credentials:

ACR_NAME=YOUR_ACR_NAME
ACR_SERVER=YOUR_ACR_SERVER_NAME # ending with ****.azurecr.io
SERVICE_PRINCIPAL_NAME=chatsp # name it as you like

ACR_REGISTRY_ID=$(az acr show --name $ACR_NAME --query "id" --output tsv)
SP_PASSWORD=$(az ad sp create-for-rbac --name $SERVICE_PRINCIPAL_NAME --scopes $ACR_REGISTRY_ID --role acrpull --query "password" --output tsv)
USER_NAME=$(az ad sp list --display-name $SERVICE_PRINCIPAL_NAME --query "[].appId" --output tsv)


kubectl create secret docker-registry acrsecret4 \
    --namespace default \
    --docker-server=$ACR_SERVER \
    --docker-username=$USER_NAME \
    --docker-password=$SP_PASSWORD

Create Azure File storage for model weights

To prevent Hugging Face TGI from downloading weights from the Hub on every run, we’ll mount Azure File storage to our Kubernetes deployment.

First, create a new file named storage_class.yaml for specifying StorageClass as follows:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: modeldata
provisioner: file.csi.azure.com # replace with "kubernetes.io/azure-file" if aks version is less than 1.21
allowVolumeExpansion: true
mountOptions:
 - dir_mode=0777
 - file_mode=0777
 - uid=0
 - gid=0
 - mfsymlinks
 - cache=strict
 - actimeo=30
parameters:
  skuName: Standard_LRS

Then, specify a Persistent Volume Claim manifest persistent_volume_claim.yaml as follows:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: modeldata-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: modeldata
  resources:
    requests:
      storage: 100Gi

Now, use kubectl to create the Persistent Volume:

kubectl apply -f az-file-sc.yaml
kubectl apply -f az-file-pvc.yaml

We’ll then create a manifest file tgi_k8s.yaml to specify a multi-container pod for our frontend chat app and large language model inference server. Depending on your performance requirement, model selection and available quota, update the CPU/Memory specification.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llmchat
spec:
  selector:
    matchLabels:
      app: llmchat
  template:
    metadata:
      labels:
        app: llmchat
    spec:
      nodeSelector:
        kubernetes.io/os: linux
      containers:
        - image: ghcr.io/huggingface/text-generation-inference:1.4
          name: tgi
          ports:
            - containerPort: 80
              name: http
          env:
            - name: MODEL_ID
              value: 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
          resources:
            requests:
              memory: "10Gi"
              cpu: "2"
            limits:
              memory: "16Gi"
              cpu: "2"
          volumeMounts:
            - mountPath: /data
              name: volume
              readOnly: false
        - name: chatui
          image: YOUR_ACR_SERVER/chatui2:latest 
          ports:
            - containerPort: 3000
              name: ui
          resources:
            requests:
              cpu: 0.5
              memory: "1Gi"
      volumes:
        - name: volume
          persistentVolumeClaim:
            claimName: modeldata-claim
        - name: nginx-config-volume
      imagePullSecrets:
        - name: acrsecret4

kubectl apply -f tgi_k8s.yaml

We’ll also create corresponding services to interact with our containers. You don’t necessarily need to create a service for the LLM server as containers in a pod can effortlessly communicate with each other via the `localhost` but I still do that because I found it helpful for testing my LLM server with direct cURL commands.

curl LLM_SERVICE_IP_ADDRESS/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Eventually, I did run into quota limitation issues as kubernetes services themselves consume CPU so I have this commented out below in a services.yaml file.

# # Service for backend container
# apiVersion: v1
# kind: Service
# metadata:
#   name: backend-service
# spec:
#   selector:
#     app: llmchat
#   ports:
#     - protocol: TCP
#       port: 80
#       targetPort: 80
#   type: LoadBalancer
# ---
# Service for UI container
apiVersion: v1
kind: Service
metadata:
  name: ui-service
spec:
  selector:
    app: llmchat
  ports:
    - protocol: TCP
      port: 80
      targetPort: 3000
  type: LoadBalancer

kubectl apply -f services.yaml

At this point, your AKS cluster is ready to serve your chat application on the provisioned external IP address. You can copy the IP address and paste into your browser. External IP address allocation can take a while and until it’s completed, it will show as “PENDING”.

kubectl get svc

Step 4: Create Nginx Ingress Controller

You can now access your chat app but as soon as you interact with the LLM, you’ll run into an error that says“You don’t have access to this conversation. if someone gave you this link, ask them to use the share feature instead”. After hours of debugging and scratching my head, I realized that people at this open PR hinted at a possible TSL/SSL issue.

To solve this error, I created a self-signed SSL certificate and used Ingress Controller to add another layer for TLS termination (this is something you might want to do in more production-level scenarios anyway).

An ingress controller is a software for Kubernetes services, offering reverse proxy, customizable traffic routing, and TLS termination. It utilizes Kubernetes ingress resources to configure rules and routes for individual services, enabling a single IP address to efficiently manage traffic across multiple services within a Kubernetes cluster.

Create an ingress manifest file ingress.yaml as follows:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-service
  namespace: default  # Update the namespace if necessary
spec:
  ingressClassName: nginx
  defaultBackend:
    service:
      name: ui-service
      port:
        number: 80
  tls:
    - hosts:
        - my-service  # Update with your desired hostname
      secretName: my-tls-secret  # Update with your TLS secret name

Use the following bash commands to deploy your ingress:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install my-ingress-nginx ingress-nginx/ingress-nginx
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout tls.key -out tls.crt -subj "/CN=my-service"
kubectl create secret tls my-tls-secret --cert=tls.crt --key=tls.key
sleep 30 # delay to avoid 'no endpoints available for service'
kubectl apply -f ingress.yaml

Once done, run kubectl get svc to view a list of services:

Copy the external IP of the ingress service that you just created. You can now access the chat app via the ingress service. Remember to add “https://” to make sure the requests are sent via HTTPS.

I also needed to reset the ingress setup frequently to get everything right for which I used the following set of commands:

kubectl delete deployment my-ingress-nginx-controller
kubectl delete ingress my-service
kubectl delete svc my-ingress-nginx-controller 
helm uninstall my-ingress-nginx

Step 3 (optional): View the conversation data in MongoDB

Thanks for reading!

GitHub Repo: https://github.com/shah-zeb-naveed/azure-llm-deployments

Build a custom Chatbot using Hugging Face Chat UI and Cosmos DB on Azure Kubernetes Service

Written by Shahzeb Naveed