Host your own model to your customers from pennies per hour.

Autoscaling Llama Server in the Cloud from $0.08/hr

Published in

TeraSky

10 min readApr 19, 2024

Once you’ve trained a base model on your customer support content and database, how do you deploy that model autoscaled out enough to support 100+ concurrent customers in support chat? Most private LLM offerings are targeted at single sessions or labs. This is how you can scale out your own chat service or API for your customers effectively and cheaply with autoscaling for demand. You can also keep all that data away from third parties. I will start with GCP for cost savings but later create equivalent modules and Packer images for AWS and Azure. Special thanks to Grant Webb at Google London for assisting on this.

LLM models are read-only during inference which makes them perfect for stateless load balancing on affordable spot instances.

The quickly evolving world of private LLM services and models is exciting. While I’ve guided a few TeraSky customers building on-prem clusters for inference and training locally, it’s time to update that for larger groups in the cloud. This is a follow-up to my earlier post Production Grade Llama. In this case I build cloud autoscaling LLM inference on a shoestring budget. You can use this module to deploy a cluster hosting your models whether you need a hundred instances peak or zero for overnight.

Production Grade Llama

For anybody looking to experiment with AI or local LLMs who doesn’t want the sticker shock of a surprise cloud bill or…

blog.gopenai.com

At this point it’s easy enough for me to run a fully performant local Llama server. Adding a pair of relatively old GV100 GPUs in NVlink to my even older dual Xeon workstation is highly cost effective for just me. Total capital investment of around $4000 is more than adequate for a single inference session at a time. What about a team or customers? What if I’m looking to provide the API or app to hundreds of customers? This is where cloud becomes the better option. There are plenty of modern high tech options in the cloud from older and newer GPU devices to new custom inference hardware and proprietary specialized instance types. Many of these options are in high demand, pricey, and actually overkill for my lightweight specialized LLMs.

I run a pair of 5 year old workhorse Voltas (2 x Quadro GV100 32GB). The 16GB version in cloud is affordable.

Here I’ll show how older cloud instances with older GPUs can be cheaply auto-scaled in a cluster for as low as $0.08 per hour. Any time specialized hardware is used it will need to be supported by the kernel. Kubernetes clusters can be provisioned via GKE on GPU VMs but the OS image and control plane can’t be customized. In this case we’ll want to control the version of Nvidia driver and CUDA stack so that means a container won’t cut it for us unless the container host has support which would be uncommon in cloud native Kubernetes solutions. We need to build a custom VM image (AMI) with the right kernel modules and CUDA stack to boot and start serving Llama immediately with the model of our choice. I’ll create a Packer template and some Terraform to deploy the autoscale group. This will also build Llama.cpp server from source each time which isn’t too efficient but will give us a consistently optimized platform for autoscale groups. Model choice will need to be flexible so we can try the flood of evolving models in the community and HuggingFace.

Keeping up with model releases on HuggingFace.

Strategy

The architecture for this considers a few important facts. Let’s assume we have a small model specialized in something like customer support trained on product documentation.

Older platforms are fine. I don’t need the latest H100 or GB100 from Nvidia for language inference. The older Volta GPUs are great and the CPUs can actually be ancient because CPU performance doesn’t matter if a model fits inside the GPU. When designing on-prem clusters you can pack 4–8 modern GPUs into an 8 year old rack server that costs $500 and inference will perform just as well as a state of the art motherboard. The power supplies and cooling are what matter most. In the case of cloud GCP offers a VM with ~5 year old Volta V100 cards on 8-9 year old Skylake generation Xeons for around $1.88/hour. This is the same card I run locally except 16GB instead of 32GB like mine which makes a big difference. They can be purchased on secondhand markets for $1,200–$1,500 today or $2,000+ for the 32GB model. Even cheaper if a model can fit in an older card like the Nvidia P4, T4, or L4 series GPUs, these VMs can be run as cheaply as $0.08/hr. Network bandwidth also isn’t very important as API endpoints will trickle at most about 50 tokens per second (~200B/s max), so there is no need for high end 10Gb networking. I’ll select Skylake CPUs with cheap T4 GPUs for their 16GB of DDR. If inference is too slow it’s simple and quick to switch to the V100 with 16GB of HBM3.
Spot instances are ideal. Models are read-only and the server is basically stateless besides random seed and cache. All state of a conversation or completion is held by the client. Servers cache parsed tokens for speed but any server in the cluster can continue a completion via TCP retry. Spot instances for our Skylake/GV100 are actually priced around $0.88/hr per VM instead of $1.88/hr which is great. If our service sits idle it will autoscale the cluster down to $0.88/hr and if GCP kills our spot instance the autoscale group will just add another one. Any of the other GPU options can see their spot prices cut in half. I could’t find a V100 instance in another cloud that cost less than about $3/hr though AWS has some really impressive dedicated inference hardware that can do Llama really quickly if you need max performance and can afford it.
Models in a bucket. The system image will use gcsfuse to mount an S3 bucket as a read-only local filesystem. It is possible to rename or move models around to select the current active one — or else pass that as configuration. This will be a bit slower than baking models into our AMI but allows swapping models without an image rebuild. Unfortunately Nvidia’s CUDA support doesn’t quite keep up with newer kernels or gcc versions so I will fall back on Ubuntu 22.04LTS by popular demand and build Llama.cpp server from source as part of the AMI build. I was hoping kernel 6.9’s FUSE passthrough would speed loads but it sounds like that won’t affect the GCSFUSE filesystem.
Autoscale metrics! A normal autoscale group will add/remove instances based on CPU usage or network usage. In this case even our older Skylake CPUs will be mostly idle as work is offloaded to the GPUs. In fact, the easiest way to detect completion being performed is a CPU usage spike from idle to about 7%. We will need to set up customized rules monitoring either GPU or network usage for the ASG to add/remove instances and also allow time for the services to start.
Sticky sessions optional. Since the only state in the server is cache, sticky sessions aren’t required. A client can just take its API call to any node and it will be serviced. There is zero common back-end, just a read-only file shared in a bucket. Cache does significantly speed up re-parsing parts of a completion thread though, so it may be worth trying.
Multiple GPUs unnecesary. Multiple GPUs on a VM can boost RAM and help with training. In this case we’re trying to scale out instead of up. Supporting multiple inference requests in parallel means redundant copying othe same model to multiple systems. If your GPU needs increase all you need to do is change the instance type selection in Terraform for the existing AMI but we shouldn’t be worried for now.
Docker optional. Docker images are automated as part of Llama.cpp’s build pipelines. GCP also allows the option of launching a container as part of VM instance provisioning. This is a valid option but it also means new releases may change or break behavior in new instances. Llama.cpp uses commit hash for versioning which isn’t ideal. I will elect to build Llama.cpp into the AMI and enable it via a systemd service. This also means Llama server logs will go straight into journald and flow into cloud log monitoring by default. Simples.

FUSE Passthrough is Coming to Linux: Why This is a Big Deal

Anybody who knows my love of Linux knows my passion for everything from hardware devices to REST APIs being presented…

boeroboy.medium.com

Packer

A production Llama server should probably use signed packaging which protects the binary, config, and systemd units from tinkery. In this case I’ll just build an optimized binary from the latest default branch and throw config into the systemd unit. NVidia has a marketplace image based on Ubuntu 22.04 but funny enough the marketplace builder for Packer is far more restricted than the standard VM builder. Also the marketplace image requires minimum 8VCPU and 16GB RAM which is kind of silly just for a small build VM. The build instance will need some kind of Nvidia hardware present or the Nvidia driver installer will fail in Ubuntu (this isn’t an issue in Fedora). Selecting a moderate instance we can then use two heredoc provisioners to prep everything we need. First a file for the systemd unit into an accessible dir. Then a shell to install everything.

provisioner "file" {
  destination = "/tmp/llama.service"
  content     = <<EOF
    [Unit]
    Description=Llama.cpp server CUDA build.
    After=syslog.target network.target local-fs.target remote-fs.target nss-lookup.target

    [Service]
    Type=simple
    User=llama
    #EnvironmentFile=/etc/sysconfig/llama
    ExecStart=/usr/bin/llamaserver -m /mnt/${var.llama_model} -c ${var.llama_context_size} --host :: --port 80
    ExecReload=/bin/kill -s HUP 
    Restart=never

    [Install]
    WantedBy=default.target
    EOF
}

Note the runtime config is included in Packer variables which may not be ideal. This includes the model file mounted from our bucket at /mnt/ and the runtime context size. The build will create a llama user and use port 8080 to avoid org-wide firewall rules blocking 80 for internal health checks. Hardening guidelines would provide a user and a userspace port like the default 8080. You may choose to rename or move models in your bucket to set the model rather than rebuilding an image with new config. By default CUDA will balance workloads across all GPUs in the system but we are just using one. If you want to limit GPU visibility the CUDA_VISIBLE_DEVICES environment variable can be used to mask unwanted devices.

provisioner "shell" {
  inline = [ <<EOF
    curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
    sudo bash add-google-cloud-ops-agent-repo.sh --also-install

    sudo apt update
    sudo apt upgrade -y
    sudo apt install -y nvidia-cuda-toolkit gcsfuse git make build-essential nvidia-driver-545
    sudo modprobe nvidia

    sudo add-apt-repository multiverse
    export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
    echo "deb [signed-by=/usr/share/keyrings/cloud.google.asc] https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo tee /usr/share/keyrings/cloud.google.asc
    

    echo Adding models bucket to fstab.
    echo "${var.modelbucket} /mnt gcsfuse allow_other" | sudo tee -a /etc/fstab

    nvidia-smi || echo "Failed nvidia-smi.. continuing"

    git clone https://github.com/ggerganov/llama.cpp.git
    cd llama.cpp
    make -j4 LLAMA_CUDA=1 LLAMA_FAST=1 CUDA_DOCKER_ARCH=all CUDA_VERSION=${var.cuda_version} server

    sudo mv server /usr/bin/llamaserver
    sudo mv /tmp/llama.service /usr/lib/systemd/system
    sudo chown root:root /usr/bin/llamaserver /usr/lib/systemd/system/llama.service
    sudo systemctl daemon-reload
    sudo systemctl enable --now  llama.service

    wait
    EOF
  ]
}

Simply supply your own credentials and run a packer build in that directory. We will use Terraform to create an instance template with our desired hardware for autoscale instance groups.

Terraform

Those who have read my Packer book know a trick or two to optimize Packer/Terraform integration. For starters, there are Packer variables and Terraform variables. Modern Packer uses HCL which means that I can define a common variables file and symlink it into the Packer directory to use the exact same file and variables between Packer and Terraform. That saves some redundant code and reduces mistakes.

~/c/terraform-google-llama-autoscale (main)> tree
.
├── examples
│   └── t4.tf
├── LICENSE
├── main.tf
├── outputs.tf
├── packer
│   ├── llamacuda.pkrvars.hcl
│   ├── llamacuda_ubuntu.pkr.hcl
│   └── variables.pkr.hcl -> ../variables.tf
├── README.md
├── terraform.tfstate
├── terraform.tfstate.backup
└── variables.tf

3 directories, 11 files

Now with shared variables we can pull out all the bare necessities like region, project, bucket, etc into a common file for Terraform and Packer rather than wonder if I used proj_id or project_id or whatever idiosyncracies come up.

HashiCorp Packer in Production: Efficiently manage sets of images for your digital transformation…

HashiCorp Packer in Production: Efficiently manage sets of images for your digital transformation or cloud adoption…

www.amazon.co.uk

Now all we need to do is create an Instance Template for the AMI we built with Packer and then create an auto-scaling Managed Instance Group with that template. The template allows us to choose what hardware to launch on. My preferred general model (amethyst-13b-mystral) takes 8.6GB at Q5 quantization. For proving this out, I’ll start with lower quality Q3 quantization for 6GB instead. This allows my models to use the cheaper 8GB P4 GPU instances at $0.08/hr per spot VM.

There are public modules available for a quick autoscaler based on a VM instance template but most don’t give quite what I want — namely no external IPs. There is no reason we should need access or SSH from public networks so let’s keep them internal and open firewall ports for public health checks.

Most people will want to set up DNS and TLS certificates on their own. Given this I will start with a module that doesn’t use TLS (*gasp*) as a PoC and leave stubs for DNS/TLS. Production users of this module may fork it and adjust as necessary.

Generic frontline customer support provided by my own model in the cloud. Cost for response: $0.00035.

Conclusion

You don’t always need an expensive subscription to managed general purpose AI. If your business needs specialist models to fit just your use case these models can often be trained and deployed minimally, and in an autoscaling cluster to meet peak needs when you have 3 users or 3,000 users. You can use this for the standard UI or embedding the API in your application without a third party service recording your data and potentially sensitive customer information. If anybody is curious to deploy their own, please check out the module I’ve written as a PoC and feel free to customize it for your use case. Any questions I’m happy to answer. The module code is available in this repo and will be published to registry if there is interest. https://github.com/jboero/terraform-google-llama-autoscale