Running GPU-Accelerated LLM Workloads on EKS

Published in

Prodigy Engineering

9 min readDec 28, 2023

A surreal representation of a Language Model on Kubernetes with GPU nodes. An ethereal book symbolizes the model, emitting luminous words transforming into data packets. Kubernetes forms a labyrinth of floating ice-blue cubes connected by bright lines, representing network traffic. GPU nodes are imposing towers radiating dynamic energy waves for processing power. The backdrop is an obscure digital cosmos, capturing the intrigue of the cloud environment. — Created by Author with DALL-E3

In this article, we’ll be taking a look at how to run a GPU-accelerated open-source Large Language Model (LLM) inference workload using Elastic Kubernetes Service (EKS). Perhaps the most buzzwords I’ve ever unironically strung together in a single sentence. Wondering what inference means? It’s the fancy term for prompting a generative AI model for output, as opposed to training the model. The GPU-accelerated-workload-on-K8S part is what motivated me to write this, and sharing with you how to do this is really “the point” of the article.

The AI model we’ll be using to demonstrate this is Mistral AI’s 7 billion parameter model and we’ll be serving it with Hugging Face’s text-generation-inference server, which of course we’ll be running on our own EKS cluster.

A caveat: what we end up with by the end of the article will not be a production-ready LLM inference service. We’ll only get to the point where our EKS cluster can successfully run the model. Taking it further will be left to your capable hands :)

Overview of the tech we’ll be using:

AWS EKS (≥ v1.28.0)
Karpenter (v.0.31) for provisioning GPU nodes
NVIDIA’s k8s-device-plugin to expose GPUs to pods
Mistral 7B LLM, our open-sourced LLM
Hugging Face text-generation-inference (TGI) server
Bonus: Basic Chat UI with huggingface-hub and gradio

I won’t be covering how to create an EKS cluster or how to install Karpenter, but I will be showing how to provision a pool of GPU nodes that are capable of running the Mistral 7B model on the TGI server.

We’ll be using Persistent Volume Claims (PVCs), so if you wish to follow along make sure these are available on your EKS cluster. You’ll likely want to do this via Amazon EBS or EFS CSI drivers.

Make GPUs accessible to pods on our cluster

Accelerating K8S workloads with GPUs is not just a matter of throwing some GPU nodes in the pool. There is some critical configuration at the node and cluster level. But before we get into that, which GPUs do we actually need to run Mistral 7B via our inference server?

Figure out our GPU requirements

Through trial and error I realized that our Mistral model requires flash attention v2 (courtesy of NotImplementedError errors).

It was not immediately clear to me what Mistral’s GPU requirements were from the model’s Hugging Face page, but after digging deeper I found some helpful clues in the Mistral AI self-hosting documentation. Looking into Flash Attention v2, I learned that the following GPUs implement this algorithm:

NVIDIA A10
NVIDIA A100
NVIDIA H100

Reviewing the accelerated computing instance types I found that the G5 instances used NVIDIA A10G Tensor Core GPUs and had the cheapest hourly rate of instance types that met the GPU requirements.

Provision GPU nodes with Karpenter

As mentioned previously, this article is not going to cover installing Karpenter. Specifically I was using Karpenter v0.31 at the time of writing. If you are using ≥ v0.32 then you may wish to review Karpenter’s migration docs.

Now, there are two resources we’ll need in order to provision our GPU nodes with Karpenter:

AWSNodeTemplate (EC2NodeClass in ≥ v0.32)
Provisioner (NodePool in ≥ v0.32)

And here are our Karpenter manifests (I’ll explain things after the snippet):

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: gpu-template
spec:
  subnetSelector: { ... }        # required, discovers tagged subnets to attach to instances
  securityGroupSelector: { ... } # required, discovers tagged security groups to attach to instances
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-a10g
spec:
  labels:
    provisioner: gpu-a10g
    zone: default
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["g"]
    - key: "karpenter.k8s.aws/instance-gpu-name"
      operator: In
      values: ["a10g"]
    - key: "karpenter.k8s.aws/instance-gpu-count"
      operator: Gt
      values: ["0"]
    - key: "karpenter.k8s.aws/instance-gpu-count"
      operator: Lt
      values: ["4"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "kubernetes.io/os"
      operator: In
      values: ["linux"]
  limits:
    resources:
      nvidia.com/gpu: "4"
  providerRef:
    name: gpu-template
  consolidation:
    enabled: true
  kubeletConfiguration:
    clusterDNS: ["10.0.1.100"]
  taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: "NoSchedule"

Instance selection

The requirements field determines the types of instances Kapenter can provision as part of our gpu-a10g node group. Some important required labels include:

instance-category
instance-gpu-name and instance-gpu-count
arch and os

With these labels we’ll be provisioning instance from the g5 family, which have our desired GPUs: NVIDIA A10G.

I’ll get more into this later, but Karpenter will automatically select the appropriate EKS AMIs for our instances. For me, I’m getting an amazon-eks-gpu-node AMI that corresponds to my EKS version.

Cost mitigation

These GPU-accelerated instances can be pretty expensive. Fortunately, we don’t really need a lot of GPU time to reach the objective of this article, but if we are not careful, we might end up accidentally wasting money. I’m doing a few things here to help avoid an accidental money bonfire:

Limiting the provisioner to single-gpu nodes. These nodes are generally cheaper, and I only intend to run a single-pod workload.
Limiting the provisioner to only 4 GPUs across all nodes in the pool caps off the $/hour rate this provisioner could incur.
Enabling consolidation does several things, but importantly for us it removes our GPU nodes when there are no workloads active. We’ll want to remove the inference server pod when we are not using it to take full advantage of this.
Preventing pods from unintentionally scheduling on our GPU provisioner by setting taints. Without this we cannot rely so much on consolidation.
Using spot instances so that we’ll potentially be paying below the standard rate for our nodes. It also means these nodes might be terminated unexpectedly, which is okay for our demo workloads.

With this in place, what we need to ensure is that we remove the inference pod when we are done with it. Please don’t leave a ≥ $1/hour EC2 instance chilling on your cluster…

Kubelet Configuration

It’s recommended to explicitly set the clusterDNS field in the Provisioner configuration to match the actual service CIDR used by your Kubernetes cluster. This ensures that the kubelet uses the correct DNS IP address for resolving service names, which in turn ensures that the pods can correctly resolve service names. This may not matter for our demo, but we might as well do things right.

Expose GPUs to pods with NVIDIA K8S Device Plugin

At this point we should be able to schedule pods to our gpu-a10g node pool, but our pods won’t actually be able to access the GPUs until we install NVIDIA K8S device plugin on our cluster. This is a daemonset that makes NVIDIA GPUs “visible” to K8S.

Prepare the GPU nodes

Before installing the NVIDIA device plugin, our nodes should have NVIDIA’s container toolkit and runtime installed and configured.

nvidia-container-toolkit
nvidia-container-runtime

Fortunately, our amazon-eks-gpu-node AMIs already have the toolkit and runtime setup. If you are unable to use this AMI for some reason, you could refer to NVIDIA’s container toolkit installation docs and create a custom AMI or update the userData on the AWSNodeTemplate resource we defined previously.

Install the NVIDIA Device Plugin daemonset

There are a few ways to install this device plugin. I’m a fan of using Helm, and there is an official chart provided by NVIDIA. We can install the plugin chart like so:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.3 \
  --set nodeSelector.provisioner=gpu-a10g

An important detail to note is that we are setting nodeSelector to explicitly target just our GPU nodes. For more details on this particular helm chart, check out:

A quick test

NVIDIA provides a container image that we can run on our cluster to confirm whether pods have access to GPUs. Here is a manifest for spinning up such a pod:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    provisioner: gpu-a10g
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

After the pod schedules, check the logs with kubectl logs pod/gpu-test-pod and make sure you see something like:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Amazing! Now we’re ready to start looking at deploying our text generation inference server.

Deploy the Text Generation Inference server

Since we just want to prove to ourselves that our cluster can perform GPU-accelerated inference, I’m going to keep the K8S config for our workload very simple.

K8S manifests

With that in mind, here is a pretty minimal set of manifests that we can apply to our cluster. This will create a pod that runs Hugging Face’s TGI server, and uses Mistral’s 7B as the text-gen model. An explanation for the configuration will follow the YAML snippet.

---
apiVersion: v1
kind: Pod
metadata:
  name: text-inference
  labels:
    app: text-inference
spec:
  containers:
    - name: text-generation-inference
      image: ghcr.io/huggingface/text-generation-inference:1.3
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          cpu: "4"
          memory: 4Gi
          nvidia.com/gpu: 1
      command:
        - "text-generation-launcher"
        - "--model-id"
        - "mistralai/Mistral-7B-v0.1"
        - "--num-shard"
        - "1"
      ports:
        - containerPort: 80
          name: http
      volumeMounts:
        - name: model
          mountPath: /data
        - name: shm
          mountPath: /dev/shm
  volumes:
    - name: model
      persistentVolumeClaim:
        claimName: text-inference-model
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 1Gi
  nodeSelector:
    provisioner: gpu-a10g
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  restartPolicy: Never
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: text-inference-model
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
  name: text-inference
spec:
  ports:
    - port: 80
      protocol: TCP
      targetPort: http
  selector:
    app: text-inference
  type: ClusterIP

Container resources

Perhaps the first interesting detail is that we use a new resource type: nvidia.com/gpu. This resource type is made available by NVIDIA’s k8s device plugin, which we set up earlier. You can find more about it in their readme.

Since our nodes are limited to a single GPU, we expect this pod to basically get a node to itself.

Container command

Here’s where we are selecting the generative AI model that we will be prompting. Hugging Face supports many other models, but the hardware and software requirements for different models can vary.

We also set the number of shards to 1, which should match the number of GPUs requested by our pod.

Pod volumes

There are two volumes being used by our pod:

model: which contains the weights and biases (in the form of safetensors)
shm: which is shared memory and based on recommendations from TGI’s docs.

For the model volume I am using a persistent volume claim. This can reduce the start-up time for scheduling subsequent inference pods. However, I feel this is a premature optimization, because our 7B model can actually download pretty fast, and our inference service is just for demonstration and does not need to scale.

Service

Using a service is optional for our demo, since we’ll only use it to set up port-forwards, which can also be done directly with pods. But depending on your cluster network, you may want to go a step further and set up ingress to the service.

Applying the manifests

Now, let’s use kubectl apply or another preferred method to apply the manifests mentioned above.

From my experience, this pod will take several minutes to deploy. We’re going to wait for a new node to provision, then the image is several GiBs large and can take a few minutes to pull. Finally, our container will also download the target model’s weights and biases.

So now we wait.

The pod is ready? Here’s a quick test to see if it works

Let’s side-step the question of ingress by just port-forwarding to our inference service: kubectl port-forward svc/text-inference 8080:80

Now, we can do a simple test with curl:

curl 127.0.0.1:8080/generate \
  -X POST -H 'Content-Type: application/json' \
  -d '{
    "inputs":"Hello? Is there anybody in there?",
    "parameters":{"max_new_tokens":20,"temperature": 0.5}
  }'

Did it work? Well, it works on my cluster :P

There is some variability based on temperature, but my output looks like:

{"generated_text":" Just nod if you can hear me. Is there anyone at home?\n\nI’m not"}

Probably still need to dial those request parameters in…

And there we have it (hopefully)! A GPU-accelerated open-source large language model inference server running on your very own cluster.

Bonus: Run a local chat client

I’ll throw in a quick little something extra, because while it is pretty cool having an open-source model running (inferring?) on our cluster, only using curl to access it feels a bit anti-climactic.

Hugging Face gives us options for some respectable starting points. I tried out the Gradio example and ended up with this:

import gradio as gr
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:8080")

max_tokens = 1024
system_prompt = "You are helpful AI."

def inference(message, history):
    prompt = f"System prompt: {system_prompt}\n Message: {message}."
    partial_message = ""
    for token in client.text_generation(prompt, max_new_tokens=int(max_tokens), stream=True, repetition_penalty=1.5):
        partial_message += token
        yield partial_message

gr.ChatInterface(
    inference,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Chat with me!", container=False, scale=7),
    description="Gradio UI consuming TGI endpoint with Mistral 7B model.",
    title="Gradio 🤝 TGI",
    examples=["Are tomatoes vegetables?"],
    retry_btn="Retry",
    undo_btn="Undo",
    clear_btn="Clear",
).queue().launch()

There are a lot of directions you can take from here. Hope you have fun!

And don’t forget to scale down your pods when you don’t need them 🤦‍♀