ChatGPT clone in 30 minutes on AWS Kubernetes

6 min readDec 21, 2023

In this blog post I will demonstrate how employing Cluster.dev can streamline launching one of the Hugging Face LLMs with chat on AWS cloud, on top of a Kubernetes cluster, and make it production-ready.

Hugging Face TGI and Chat-UI

In addition to models, datasets, and Python libraries, Hugging Face also provides Docker containers for local inference, including projects like Text Generation Inference (a Docker container to serve models) and Chat-UI (a Docker image for interactive chatting with models, akin to ChatGPT’s interface).

While this is enough for local deployment and testing, deploying it quickly to Kubernetes would take considerable effort and a lot of configuration.

For that reason, we decided to simplify this process for users who want to deploy LLMs in their cloud accounts without struggling with intricate infrastructure development and management.

Kubernetes, Helm, Terraform, and Cluster.dev

Data scientists commonly utilize Python for testing, fine-tuning, and serving models. Yet, when it comes to production, DevOps teams need to integrate this into the infrastructure code. Notably, Kubernetes offers around 20% better GPU node costs compared to SageMaker, with more flexible scalability. Terraform is often employed for provisioning production infrastructures, coupled with Helm for deploying software to Kubernetes.

The Cluster.dev’s open-source framework has been designed specifically for the scenarios where you need to deploy comprehensive infrastructures and software with minimal commands and documentation. Think of it as the Terraform and Helm equivalent of InstallShield (next->next->install), enabling the installation of any software on your cloud accounts. Further details are available at docs.cluster.dev.

Quick Start on EKS

While we’ll demonstrate here the workflow using Amazon AWS cloud and managed EKS, it can be adapted for any other cloud provider and Kubernetes version.

Prerequisites

AWS cloud account credentials.
AWS Quota change requested for g5 or other desired types of GPU instances.
Cluster.dev and Terraform installed.
Select a Hugging Face model with .safetensors weights from Hub. Alternatively, you can upload the model to an S3 bucket; see the example in bootstrap.ipynb.
Route53 DNS zone (optional).

Create an S3 Bucket for storing state files:

aws s3 mb s3://cdev-states

Clone the repository with the example:

git clone https://github.com/shalb/cdev-examples/
cd cdev-examples/aws/eks-model/cluster.dev/

Edit Configuration files

project.yaml — serves as the primary configuration for the project, defining essential global variables like organization, region, state bucket name, and more. It also facilitates the setting of global environment variables.

backend.yaml — configures the backend for Cluster.dev states, including Terraform states, and relies on variables specified in project.yaml.

stack-eks.yaml— outlines the AWS infrastructure configuration, including VPC, domains, and EKS (Kubernetes) settings. Detailed information can be found in the Stack documentation.

The most important part here is the configuration of your GPU nodes. Specify their capacity_type (ON_DEMAND, SPOT), instance types, and autoscaling settings (min/max/desired). Additionally, set disk size and node labels if required. The key settings to configure next are:

  cluster_name: k8s-model # change this to your cluster name
  domain: cluster.dev # if you leave this domain it would be auto-delegated with the zone *.cluster_name.cluster.dev
  eks_managed_node_groups:
    gpu-nodes:
      name: ondemand-gpu-nodes
      capacity_type: ON_DEMAND
      block_device_mappings:
        xvda:
          device_name: "/dev/xvda"
          ebs:
            volume_size: 120
            volume_type: "gp3"
            delete_on_termination: true
      instance_types:
        - "g5.xlarge"
      labels:
        gpu-type: "a10g"
      max_size: 1
      desired_size: 1
      min_size: 0

You can create additional node groups by adding similar blocks to this YAML. Refer to the complete list of available settings in the corresponding Terraform module.

stack-model.yaml — describes the HF model Stack, referencing the model stack template in the model-templatefolder and includes the installation of the necessary Nvidia drivers.

The model stack primarily utilizes values from the huggingface-model Helm chart, which we have prepared and continue to develop. Check the default values.yaml for a comprehensive list of available chart options. Here are the main configurations you need to adjust:

  chart:
    model:
      organization: "HuggingFaceH4"
      name: "zephyr-7b-beta"
    init:
      s3:
        enabled: false # if set to false the model would be cloned directly from HF git space
        bucketURL: s3://k8s-model-zephyr/llm/deployment/zephyr-7b-beta  # see ../bootstrap.ipynb on how to upload model
    huggingface:
      args:
        - "--max-total-tokens"
        - "4048"
        #- --quantize
        #- "awq"
    replicaCount: 1
    persistence:
      accessModes:
      - ReadWriteOnce
      storageClassName: gp2
      storage: 100Gi
    resources:
      requests:
        cpu: "2"
        memory: "8Gi"
      limits:
        nvidia.com/gpu: 1
    chat:
      enabled: true

Deploy Stacks

After finishing the setup, you can deploy everything with just one command:

cdev apply

The list of resources to be created:

Plan results:
+----------------------------+
|      WILL BE DEPLOYED      |
+----------------------------+
| cluster.route53            |
| cluster.vpc                |
| cluster.eks                |
| cluster.eks-addons         |
| cluster.kubeconfig         |
| cluster.outputs            |
| model.nvidia-device-plugin |
| model.model                |
| model.outputs              |
+----------------------------+

The whole process takes around 30 minutes, check this video to get an idea:

Deploy Huggingface model to Kubernetes with Cluster.dev

Recorded by voatsap

asciinema.org

Working with Infrastructure

Let’s consider some tasks that we can perform on top of this stack.

Interacting with Kubernetes

The kubeconfig file that you get after deploying the stack will enable you to authorize to the cluster, check workloads, logs, etc:

# First we need to export KUBECONFIG to use kubectl
export KUBECONFIG=`pwd`/kubeconfig
# Then we can examine workloads deployed in the `default` namespace, since we have defined it in the stack-model.yaml
kubectl get pod
# To get logs from model startup, check if the model is loaded without errors
kubectl logs -f <output model pod name from kubectl get pod>
# To list services (should be model, chat and mongo if chat is enabled)
kubectl get svc
# Then you can port-forward the service to your host
kubectl port-forward svc/<model-output from above>  8080:8080
# Now you can chat with your model
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"Continue funny story: John decide to stick finger into outlet","parameters":{"max_new_tokens":1000}}' \
    -H 'Content-Type: application/json'

Changing node size and type

Consider a scenario where you have a large model and need to serve it with substantial instances, preferably using cost-effective spot instances. To achieve this, you simply need to change the type of the node group:

    gpu-nodes:
      name: spot-gpu-nodes
      capacity_type: SPOT
      block_device_mappings:
        xvda:
          device_name: "/dev/xvda"
          ebs:
            volume_size: 120
            volume_type: "gp3"
            delete_on_termination: true
      instance_types:
        - "g5.12xlarge"
      labels:
        gpu-type: "a10g"
      max_size: 1
      desired_size: 1
      min_size: 0

After making changes, apply them by running cdev apply.

Keep in mind that spot instances might not always be available in the region. If the spot request cannot be fulfilled, check your AWS Console under EC2 -> Auto Scaling groups -> eks-spot-gpu-nodes -> Activity. If it fails, consider changing to ON_DEMANDor adjust instance_types in the manifest, then rerun cdev apply.

Changing the model

To change the model, just edit its name and organization. Then apply the changes by running cdev apply:

    model:
      organization: "WizardLM"
      name: "WizardCoder-15B-V1.0"

Enabling Chat-UI

To activate Chat-UI, set chart.chat.enable:true. This will provide a service that can be port-forwarded and accessed from the browser. For external access, include ingress configuration, as demonstrated in the sample:

    chat:
      enabled: true
      modelConfig:
      extraEnvVars:
        - name: PUBLIC_ORIGIN
          value: "http://localhost:8080"
      ingress:
        enabled: true
        annotations:
          cert-manager.io/cluster-issuer: "letsencrypt-prod"
        hosts:
          - host: chat.k8s-model.cluster.dev
            paths:
              - path: /
                pathType: Prefix
        tls:
          - hosts:
              - chat.k8s-model.cluster.dev
            secretName: huggingface-model-chat

If you are using the cluster.dev domain with your project prefix (please make sure that it is unique), the DNS zone will be configured automatically. HTTPS certificates for the domain will also be generated automatically. To monitor the progress, use the command: kubectl describe certificaterequests.cert-manager.io

If you want to expose the API for your model, configure the Ingress in the corresponding model section.

Monitoring and Metrics

Instructions for setting up Prometheus and Grafana for monitoring can be found in bootstrap.ipynb. We are planning to release a new stack template with monitoring enabled through a single option.

In this Loom video you can see the configuration for Grafana:

Questions, Help, and Feature Requests

Feel free to start a discussion in our GitHub repository.

Summary

There are numerous methods to run HF models. This article describes a scenario of launching an LLM with chat on AWS cloud, using Cluster.dev as an infrastructure installer. We believe it will be particularly useful for well-versed Kubernetes users who want to swiftly deploy their version of a ChatGPT-like model for their organization using the IaC approach.

Thanks!
Volodymyr Tsap,
CTO Cluster.dev