ChatGPT clone in 30 minutes on AWS Kubernetes
In this blog post I will demonstrate how employing Cluster.dev can streamline launching one of the Hugging Face LLMs with chat on AWS cloud, on top of a Kubernetes cluster, and make it production-ready.
Hugging Face TGI and Chat-UI
In addition to models, datasets, and Python libraries, Hugging Face also provides Docker containers for local inference, including projects like Text Generation Inference (a Docker container to serve models) and Chat-UI (a Docker image for interactive chatting with models, akin to ChatGPT’s interface).
While this is enough for local deployment and testing, deploying it quickly to Kubernetes would take considerable effort and a lot of configuration.
For that reason, we decided to simplify this process for users who want to deploy LLMs in their cloud accounts without struggling with intricate infrastructure development and management.
Kubernetes, Helm, Terraform, and Cluster.dev
Data scientists commonly utilize Python for testing, fine-tuning, and serving models. Yet, when it comes to production, DevOps teams need to integrate this into the infrastructure code. Notably, Kubernetes offers around 20% better GPU node costs compared to SageMaker, with more flexible scalability. Terraform is often employed for provisioning production infrastructures, coupled with Helm for deploying software to Kubernetes.
The Cluster.dev’s open-source framework has been designed specifically for the scenarios where you need to deploy comprehensive infrastructures and software with minimal commands and documentation. Think of it as the Terraform and Helm equivalent of InstallShield (next->next->install), enabling the installation of any software on your cloud accounts. Further details are available at docs.cluster.dev.
Quick Start on EKS
While we’ll demonstrate here the workflow using Amazon AWS cloud and managed EKS, it can be adapted for any other cloud provider and Kubernetes version.
Prerequisites
- AWS cloud account credentials.
- AWS Quota change requested for g5 or other desired types of GPU instances.
- Cluster.dev and Terraform installed.
- Select a Hugging Face model with .safetensors weights from Hub. Alternatively, you can upload the model to an S3 bucket; see the example in bootstrap.ipynb.
- Route53 DNS zone (optional).
Create an S3 Bucket for storing state files:
aws s3 mb s3://cdev-states
Clone the repository with the example:
git clone https://github.com/shalb/cdev-examples/
cd cdev-examples/aws/eks-model/cluster.dev/
Edit Configuration files
project.yaml
— serves as the primary configuration for the project, defining essential global variables like organization, region, state bucket name, and more. It also facilitates the setting of global environment variables.
backend.yaml
— configures the backend for Cluster.dev states, including Terraform states, and relies on variables specified in project.yaml
.
stack-eks.yaml
— outlines the AWS infrastructure configuration, including VPC, domains, and EKS (Kubernetes) settings. Detailed information can be found in the Stack documentation.
The most important part here is the configuration of your GPU nodes. Specify their capacity_type (ON_DEMAND, SPOT), instance types, and autoscaling settings (min/max/desired). Additionally, set disk size and node labels if required. The key settings to configure next are:
cluster_name: k8s-model # change this to your cluster name
domain: cluster.dev # if you leave this domain it would be auto-delegated with the zone *.cluster_name.cluster.dev
eks_managed_node_groups:
gpu-nodes:
name: ondemand-gpu-nodes
capacity_type: ON_DEMAND
block_device_mappings:
xvda:
device_name: "/dev/xvda"
ebs:
volume_size: 120
volume_type: "gp3"
delete_on_termination: true
instance_types:
- "g5.xlarge"
labels:
gpu-type: "a10g"
max_size: 1
desired_size: 1
min_size: 0
You can create additional node groups by adding similar blocks to this YAML. Refer to the complete list of available settings in the corresponding Terraform module.
stack-model.yaml
— describes the HF model Stack, referencing the model stack template in the model-template
folder and includes the installation of the necessary Nvidia drivers.
The model stack primarily utilizes values from the huggingface-model Helm chart, which we have prepared and continue to develop. Check the default values.yaml for a comprehensive list of available chart options. Here are the main configurations you need to adjust:
chart:
model:
organization: "HuggingFaceH4"
name: "zephyr-7b-beta"
init:
s3:
enabled: false # if set to false the model would be cloned directly from HF git space
bucketURL: s3://k8s-model-zephyr/llm/deployment/zephyr-7b-beta # see ../bootstrap.ipynb on how to upload model
huggingface:
args:
- "--max-total-tokens"
- "4048"
#- --quantize
#- "awq"
replicaCount: 1
persistence:
accessModes:
- ReadWriteOnce
storageClassName: gp2
storage: 100Gi
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
nvidia.com/gpu: 1
chat:
enabled: true
Deploy Stacks
After finishing the setup, you can deploy everything with just one command:
cdev apply
The list of resources to be created:
Plan results:
+----------------------------+
| WILL BE DEPLOYED |
+----------------------------+
| cluster.route53 |
| cluster.vpc |
| cluster.eks |
| cluster.eks-addons |
| cluster.kubeconfig |
| cluster.outputs |
| model.nvidia-device-plugin |
| model.model |
| model.outputs |
+----------------------------+
The whole process takes around 30 minutes, check this video to get an idea:
Working with Infrastructure
Let’s consider some tasks that we can perform on top of this stack.
Interacting with Kubernetes
The kubeconfig
file that you get after deploying the stack will enable you to authorize to the cluster, check workloads, logs, etc:
# First we need to export KUBECONFIG to use kubectl
export KUBECONFIG=`pwd`/kubeconfig
# Then we can examine workloads deployed in the `default` namespace, since we have defined it in the stack-model.yaml
kubectl get pod
# To get logs from model startup, check if the model is loaded without errors
kubectl logs -f <output model pod name from kubectl get pod>
# To list services (should be model, chat and mongo if chat is enabled)
kubectl get svc
# Then you can port-forward the service to your host
kubectl port-forward svc/<model-output from above> 8080:8080
# Now you can chat with your model
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"Continue funny story: John decide to stick finger into outlet","parameters":{"max_new_tokens":1000}}' \
-H 'Content-Type: application/json'
Changing node size and type
Consider a scenario where you have a large model and need to serve it with substantial instances, preferably using cost-effective spot instances. To achieve this, you simply need to change the type of the node group:
gpu-nodes:
name: spot-gpu-nodes
capacity_type: SPOT
block_device_mappings:
xvda:
device_name: "/dev/xvda"
ebs:
volume_size: 120
volume_type: "gp3"
delete_on_termination: true
instance_types:
- "g5.12xlarge"
labels:
gpu-type: "a10g"
max_size: 1
desired_size: 1
min_size: 0
After making changes, apply them by running cdev apply
.
Keep in mind that spot instances might not always be available in the region. If the spot request cannot be fulfilled, check your AWS Console under EC2 -> Auto Scaling groups -> eks-spot-gpu-nodes -> Activity. If it fails, consider changing to ON_DEMAND
or adjust instance_types in the manifest, then rerun cdev apply
.
Changing the model
To change the model, just edit its name and organization. Then apply the changes by running cdev apply
:
model:
organization: "WizardLM"
name: "WizardCoder-15B-V1.0"
Enabling Chat-UI
To activate Chat-UI, set chart.chat.enable:true
. This will provide a service that can be port-forwarded and accessed from the browser. For external access, include ingress configuration, as demonstrated in the sample:
chat:
enabled: true
modelConfig:
extraEnvVars:
- name: PUBLIC_ORIGIN
value: "http://localhost:8080"
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: chat.k8s-model.cluster.dev
paths:
- path: /
pathType: Prefix
tls:
- hosts:
- chat.k8s-model.cluster.dev
secretName: huggingface-model-chat
If you are using the cluster.dev
domain with your project prefix (please make sure that it is unique), the DNS zone will be configured automatically. HTTPS certificates for the domain will also be generated automatically. To monitor the progress, use the command: kubectl describe certificaterequests.cert-manager.io
If you want to expose the API for your model, configure the Ingress in the corresponding model section.
Monitoring and Metrics
Instructions for setting up Prometheus and Grafana for monitoring can be found in bootstrap.ipynb. We are planning to release a new stack template with monitoring enabled through a single option.
In this Loom video you can see the configuration for Grafana:
Questions, Help, and Feature Requests
Feel free to start a discussion in our GitHub repository.
Summary
There are numerous methods to run HF models. This article describes a scenario of launching an LLM with chat on AWS cloud, using Cluster.dev as an infrastructure installer. We believe it will be particularly useful for well-versed Kubernetes users who want to swiftly deploy their version of a ChatGPT-like model for their organization using the IaC approach.
Thanks!
Volodymyr Tsap,
CTO Cluster.dev