Deploy GPU Node on AKS using Terraform

Published in

Bluetuple.ai

9 min readMay 16, 2024

Introduction

Large Language models are very complex, demanding significant processing power and memory to process request with minimal latency.

Azure Kubernetes Service (AKS) offers a robust platform for deploying and managing containerised workloads. But what if your LLM projects require the extra oomph of a graphics processing unit (GPU)? This article guides you through integrating GPU-enabled nodes into your AKS cluster, unlocking the potential for faster training and smoother inference.

I will guide you through the setup using Terraform. While following the nexts steps, you will be able deploy a GPU-enabled AKS cluster on Azure.

By using an “infrastructure as code” approach this will provide you the perfect foundation for future, more complex scenarios.

Let’S dive into…

Why GPUs are Essential for Large Language Model (LLM) Inference

Large Language Models (LLMs) are revolutionising the way we interact with machines. However, running inference on these complex models — essentially using them to generate text, translate languages, or write different kinds of creative content — presents a significant computational challenge. Here’s why GPUs are an absolute must-have for LLM inference:

Parallel Processing Power:
CPUs excel at sequential tasks, but LLMs thrive on parallel processing. GPUs boast thousands of cores specifically designed to handle multiple computations simultaneously. This parallel architecture dramatically accelerates LLM inference compared to CPUs.

Memory Bandwidth:
LLMs often require massive amounts of data to function effectively. GPUs come equipped with high-bandwidth memory interfaces, allowing them to rapidly access the data needed for inference, significantly reducing processing time.

Reduced Latency:
When interacting with LLMs, low latency is crucial for a seamless user experience. GPUs minimise the time it takes for LLMs to process information and generate responses, leading to faster and more responsive interactions.

Efficiency and Cost-Effectiveness:
While CPUs can technically perform LLM inference, the process is much slower and requires more resources. GPUs offer a much more efficient solution, reducing the overall computational cost of running LLMs.

In essence, GPUs are tailor-made for the parallel processing and high memory demands of LLMs. Their ability to handle these tasks efficiently translates to faster inference times, lower latency, and ultimately, a more optimal experience when working with LLMs.

Our setup:
For our setup we will use a NVidia GPU, I will go with a VM type of NC4as_T4, which comes with a Nvidia Tesla T4 GPU, 4 CPUs and 28 GB RAM, which is sufficient for initial tests.

You can apply other VM types if required, but keep in mind that you might need to request additional quotas to be able to deploy bigger VM types.

Furthermore not all GPU hardware is available in each Azure Region, you might have to check.

Using GPU enabled VMs will incur additional costs and are not covered by the free tier. So be careful and check the pricing calculator upfront.

Furthermore, make ABSOLUTELY sure to destroy or at least power down all infrastructure when not longer needed…

Setting the Stage:

I would recommend that you should have a basic understanding how to connect to a kubernetes cluster with kubectl.

Microsoft offers a free training path, have a look, if you’re new to the topic.

Furthermore you should have a basic understanding of terraform. Be carful to keep things separated to not interfere with other workloads you may have up and running.

Before we can start, make sure that your AZURE cli is configured, terraform is installed on your local machine and the required credentials are set. furthermore, you need to have kubectl installed as well.

I’ve linked a detailed article how to get started with terraform on azure, I recommend you to follow these steps before continuing.

Terraform Remote State on Azure

Step-by-step guide to setup Terraform remote state file on azure

medium.com

You now should have a valid service principal configured, and the following environment variables are set with your individual values:

ARM_TENANT_ID=
ARM_SUBSCRIPTION_ID=
ARM_CLIENT_ID=
ARM_CLIENT_SECRET=

Keep these values secret. Storing them as environment variables on your system still bears some security risk, as everybody who have (root) access to your local environment could read the credentials.

There are more sophisticated approaches like storing them in a key vault, but this would be out of scope for this article - just be warned to be careful...

Create a separate folder on your machine, cd into that folder and create the following files (touch..):

main.tf
kubernetes.tf
resourcegroup.tf
variables.tf
terraform.auto.tfvars

The code for main.tf:

#main.tf

# azure provider
terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~>3.103.1"
    }
  }

  required_version = ">=1.5.7"

/* 

# remove comment block if you want to use remote state
# provide your individual settings
# if you're using remote state on azure
# otherwise the state will be stored on your local machine
# which would be ok for testing purpose

  backend "azurerm" {
    resource_group_name  = "your-remote-stete-rg"
    storage_account_name = "your-remotestate-storage-account"
    container_name       = "tf-state"
    key                  = "sbx/k8s"
  }
*/ 


}




provider "azurerm" {
  features {
    api_management {
      purge_soft_delete_on_destroy = true
      recover_soft_deleted         = false
    }
  }
}

For the variables fill terraform.auto.tfvars with your values:

#terraform.auto.tfvars

# general kubernetes cluster
k8s_resgroup     = "kubernetes resource group"
k8s_location     = "swedencentral"
k8s_cluster_name = "cluster01"

#user_node_pool  
user_node_pool_enable_auto_scaling = false #true
user_node_pool_max_pods    = 30
user_node_pool_node_count  = 1
user_node_pool_node_labels = { "NodePool" = "gpu" }
user_node_pool_node_taints = ["env=gpu:NoSchedule"]
user_node_pool_name        = "gpunodepool" 
user_node_pool_vm_size = "Standard_NC4as_T4_v3"

Next, the definition of the required variables in variables.tf:

#variables.tf
variable "k8s_resgroup" {
  type        = string
  description = "standard resource group for all k8s services"

}

variable "k8s_location" {
  type        = string
  description = "Location for k8s"
}

variable "k8s_cluster_name" {
  type        = string
  description = "Name of cluster"
}


#### terraform nodepool variables

variable "user_node_pool_name" {
  description = "Specifies the name of the user node pool"
  default     = "agentpool"
  type        = string
}

variable "user_node_pool_enable_auto_scaling" {
  description = "(Optional) Whether to enable auto-scaler. Defaults to false."
  type        = bool
  default     = false
}

variable "user_node_pool_max_pods" {
  description = "(Optional) The maximum number of pods that can run on each agent. Changing this forces a new resource to be created."
  type        = number
  default     = 30
}

variable "user_node_pool_node_labels" {
  description = "(Optional) A list of Kubernetes taints which should be applied to nodes in the agent pool (e.g key=value:NoSchedule). Changing this forces a new resource to be created."
  type        = map(any)
  default     = { "kubernetes.azure.com/scalesetpriority" = "spot" }
}

variable "user_node_pool_node_taints" {
  description = "(Optional) A map of Kubernetes labels which should be applied to nodes in this Node Pool. Changing this forces a new resource to be created."
  type        = list(string)
  default     = ["kubernetes.azure.com/scalesetpriority=spot:NoSchedule"]
}


variable "user_node_pool_node_count" {
  description = "(Optional) The initial number of nodes which should exist within this Node Pool. Valid values are between 0 and 1000 and must be a value in the range min_count - max_count."
  type        = number
  default     = 2
}

variable "user_node_pool_vm_size" {
  description = "Specifies the vm size of the user node pool"
  default     = "Standard_B4ms"
  type        = string
}

Please be aware, that this file for simplicity does not contain any configuration for remote state management, like storage account or a management resource group. If you followed my initial post about setting up remote state (linked above) you should already have a resource group definition and varable.tf — just edit the or add additional .tf files if you prefer to keep thing separately.

Next, we’ll need a file to define the resource groups for our kubernetes cluster (resource_group.tf):

# Resource group for kubernetes

resource "azurerm_resource_group" "k8s_rg" {
  name     = var.k8s_resgroup
  location = var.k8s_location

  
}

The last step, before we can configure a GPU node for our LLM, we need to define a kubernetes cluster:

Copy this code into the kubernetes.tf:

# Kubernetes cluster with default node pool
resource "azurerm_kubernetes_cluster" "k8s_cluster" {
  name                = var.k8s_cluster_name
  location            = var.k8s_location
  resource_group_name = var.k8s_resgroup
  dns_prefix          = "dev"

  depends_on = [azurerm_resource_group.k8s_rg]


  default_node_pool {
    name       = "nodepool"
    node_count = 1
    vm_size    = "Standard_B4ms" # small but efficient, use what ever size fits your needs
  }                              # but be aware of the costs

  identity {
    type = "SystemAssigned"

  }

}

We are now ready to deploy the initial cluster. You could provide the gpu node configuration within the above terraform script, but I like to keep things small, separeted and tidy. So we will provide a dedicated GPU Node configuration file in the next step.

Adding some GPU Muscle

Next we configure a GPU enabled cluster node. Copy the following source code into a file named gpu-node.tf:

#user node pool
resource "azurerm_kubernetes_cluster_node_pool" "user" {
  count = 1
  zones = var.user_node_pool_availability_zones
  vm_size               = var.user_node_pool_vm_size
  kubernetes_cluster_id = azurerm_kubernetes_cluster.k8s_cluster.id
  max_pods    = var.user_node_pool_max_pods
  node_count  = var.user_node_pool_node_count
  node_labels = var.user_node_pool_node_labels
  node_taints = var.user_node_pool_node_taints
  mode        = "User"
  name        = var.user_node_pool_name

  depends_on = [azurerm_kubernetes_cluster.k8s_cluster]


}

Deploy everything

The stage is prepared now and we can start the first rehearsal:

alias tf=terraform

tf init 
tf fmt
tf validate

If everything went smooth, the above commands will initialise the (local) terraform backend. Check for typos, if something pops up.

Well done, let’s deploy the cluster — this may take 10 to 15 minutes:

tf plan
tf apply

Verify Access

Once the terraform script completes, you should have a aks cluster up and running with two nodes, one system node and one gpu enabled nodes.

Azure will take responsibility to install the required NVIDUA GPU drivers automatically, which comes in very handy. There are also option to install dedicated driver version by yourself, but this is out of scope for now.

Check Access and availability

When the cluster is up and running you can request credentails for acces with kubectl:

az aks get-credentials --resource_group <your kubernetes resource group> \
  --name <your cluster name>

This will configure your local kubectl to be able to connect to the cluster, let’s check. your output should look similar:

kubectl get nodes

NAME                                  STATUS   ROLES   AGE    VERSION
aks-gpunodepool-12923324-vmss000001   Ready    agent   111s   v1.28.9
aks-nodepool-30526304-vmss000000      Ready    agent   9m2s   v1.28.9

Checking GPU Availability

Now lets check, if we have access to the gpu. create a file called CUDA-test.yaml and paste the code into:

#CUDA-test.yaml
apiVersion: v1  
kind: Pod  
metadata:  
  name: gpu-pod  
  
  
spec: 
  containers:  
  - name: gpu-container  
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04  
    command: ["/bin/bash", "-c", "--"]  
    args: ["while true; do sleep 600; done;"]  
    resources:  
      limits:  
       nvidia.com/gpu: 0
  tolerations:
  - key: "env"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

This script will allow us to deploy a simple pod on the gpu node (taint: env=gpu:NoShedule !), so let kubectl do the work for us:

kubectl apply -f CUDA-test.yaml

Wait for the pod to be up and running.

Once kubetl get pods shows the pod up, we can connect to it and check if the GPU is present:

kubectl get pods

NAME      READY   STATUS    RESTARTS   AGE
gpu-pod   1/1     Running   0          2m42s

Once the pod is up, connect to the pod’s console and execute nvidia-smi. You should see something like tihs:

kubectl exec gpu-pod -it /bin/bash

nvidia-smi


Thu May 16 14:27:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |
| N/A   31C    P8              11W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Well done! GPU at your service!

Putting it All Together

Now that your cluster is GPU-powered, you can start deploying LLM projects to leverage this enhanced processing muscle. I will provide some follow-up articles later on.

Feel free to test further, but do not forget to delete the cluster before you leave!!

Just carry out a ‘terraform destroy’ within the folder with your terraform scripts, confirm that you want to destroy everything, and terraform will delete all resources. Wait for the script to finish to make sure the cluster is deleted!

Final Words

By integrating GPUs into your AKS cluster, you unlock a powerful platform for tackling even the most demanding machine learning tasks. This article equips you with the knowledge to set up your environment and unleash the potential of GPUs for your ML projects. Remember, AKS offers flexibility and scalability, allowing you to tailor your cluster to your specific needs. So, buckle up and get ready to accelerate your machine learning journey!

I’ll add a link to the git repo asap.

If you have read it to this point, thank you! You are a hero (and a Nerd ❤)! I try to keep my readers up to date with “interesting happenings in the AI world,” so please 🔔 clap | follow