Deploy GPU Node on GKE using Terraform

Published in

Bluetuple.ai

7 min readDec 23, 2023

By the author (idea) and Dall-E-3 (color and brush)

Introduction

Testing applications with Large Language Models (LLMs) requires hardware with powerful GPUs. While LLMs like mistral-7b or similar can run with some effort on a MacBook (M1/M2, see also [here]) and be tested with some acceptable performance, some gaming PCs may also provide enough GPU performance to enable local testing in a smaller scope.

Ultimately, especially if you want to develop and test as close as possible to a potential production environment, you can set up a Kubernetes cluster with GPU capabilities.

In the following, I will describe how to configure a Kubernetes cluster on Google Cloud that has NVIDIA-L4 GPUs. This will allow you to test larger models.

For the setup, I choose a managed cluster, as most of the required management and monitoring functionalities are provided and operated by GCP.

As a GPU capable compute node, I have chosen a g2-standard-24 instance, as this offers the best price/performance ratio for my requirements. Other instances can also be deployed depending on the requirements.

A corresponding overview of which instances with which GPU support can be provided by Google can be found here:

GPU on GKE

However, Google does not offer every instance/GPU combination in every region/zone. Therefore, take a look at this list to identify a suitable zone near you:

GPU availability

And not to forget: GPUs don’t come for free, you should consider the costs in advance based on this list:

GPU Pricing on GKE

The configuration discussed in the following costs about 2.7€ per hour.

I have chosen to deploy via Terraform, as I can tear down the entire environment with a single command and (re-)deploy it again at any time.

I strongly recommend to ensure the cluster is removed after you finished evaluation to avoid unpleasant surprises with you GCP bill. At this point, I also recommend configuring appropriate budget alerts…

Prerequisites

The following prerequisites must be met to be able to rebuild the environment:

A valid GCP project with the corresponding quotas for the chosen instance size. All NVIDIA-capable instances are outside the standard quotas, so a quota adjustment is required. A description of how to request a quota adjustment can be found on this page.
Terraform CLI is installed (a documentation can be found here).
We need a service account that has the necessary permissions to make infrastructure changes. A description of how to do this can be found here: https://medium.com/bluetuple-ai/terraform-remote-state-on-gcp-d50e2f69b967
For access to the Kubernetes cluster, kubectl is required. Information about the installation can be found on the official Kubernetes website.

Let’s get started

First, we need to log in to GCP with sufficient permissions:

gcloud auth application-default login

Then, we need to set the project, target region, and zone as environment variables. When doing so, make sure that the desired instance/GPU combination has sufficient quotas in the selected region for the project.

We also set the cluster name as an environment variable:

export PROJECT_ID=<project_id>
export REGION=<region>
export ZONE=<zone>
export CLUSTER_NAME=<clustername>

gcloud config set project "$PROJECT_ID"
gcloud config set compute/region "$REGION"
gcloud config set compute/zone "$ZONE"

In the next step, we need to enable some required APIs for the project, if not already done otherwise:

gcloud services enable compute.googleapis.com container.googleapis.com

Terraform

For better readability, I have omitted storage for storing the Terraform state files in the Terraform code. A detailed description of how the remote state can be configured can be found in this article:

https://medium.com/bluetuple-ai/terraform-remote-state-on-gcp-d50e2f69b967

In general, the following configuration should be stored in a separate directory to avoid side effects with other deployments, especially with a `terraform destroy` command without additional parameters.

We will need the following files for our terraform configuration:

main.tf
variables.tf
provider.tf
terraform.tfvars
gke-cluster.tf
gke-gpu-np.tf

main.tf, variables.tf, provider.tf, and terraform.tfvars contain variable definitions and configurations that terraform needs for further deployment. I will not bore you with the details, you can find the files in the this GitHub repository.

For the Kubernetes cluster, I am using two definition files, one for the cluster itself and one for the GPU node pool.

Of course, all resource definitions can also be made in one file, but Terraform tends to delete the entire node pool or even the cluster and deploy it again with certain changes, which can sometimes take a very long time.

I have had better experiences with separate definition files, changes to a node pool have no impact on other pools or the cluster itself. Ultimately, it serves the purpose of better readability.

The definition of the cluster itself:

resource "google_container_cluster" "gpu_cluster" {
  name           = var.cluster_name
  location       = var.region
  node_locations = [var.zone]

  deletion_protection = false
  initial_node_count  = 1

  timeouts {
    create = "20m"
    update = "30m"
  }

  lifecycle {
    ignore_changes = [master_auth, network]
  }

}

The definition of our core GPU-enabled cluster should look like this

resource "google_container_node_pool" "gpu_pool" {
  name       = "gpu-pool"
  location   = var.region
  cluster    = google_container_cluster.gpu_cluster.name
  node_count = 1


  management {
    auto_repair  = "true"
    auto_upgrade = "true"
  }

  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
    ]


    labels = {
      env     = "sandbox"
      project = var.project_id

    }

    guest_accelerator {
      type  = var.gpu_type
      count = 2
      gpu_driver_installation_config {
        gpu_driver_version = var.gpu_driver_version
      }

    }

    image_type   = "cos_containerd"
    machine_type = "g2-standard-24"
    tags         = ["gke-node", "sandbox", "${var.project_id}"]

    disk_size_gb = "50"
    disk_type    = "pd-balanced"

    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }


  }

  timeouts {
    create = "20m"
    update = "30m"
  }

}

Adjust the variables to your needs. Then the Terraform can start:

alias tf=terraform
tf init
tf fmt
tf validate

If Terraform does not report any errors (potential a typo?), the cluster can be requested with:

tf plan
tf apply

From now on the bill starts ticking, so don’t forget to tear everything sown, once youre finished.

The deployment takes about 10 minutes, time for a coffee.

Once the cluster has been fully deployed, we need to request the credentials for kubectl:

gcloud container clusters get-credentials $CLUSTER_NAME –region $REGION

From now on we can communicate with the cluster as usual:

kubectl get node

kubectl get pod -A

You should see two nodes, one from default node pool, one withi our GPU node pool.

So far so good

Now we can deploy a small pod that uses GPUs. To do this, create a file called ‘cuda-pod.yaml` with the following content in the current folder:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
       nvidia.com/gpu: 1

Deploy the pod with:

kubectl apply -f cuda-pod.yaml

Once `kubectl get pod` shows the pod in the `running` status, we would be able to connect to a shell in the pod:

After a couple of seconds the state of the pod should be “running”

Now we run `nvidia-smi` on pod shell; the necessary drivers have already been provided during the deployment.

kubectl exec -it my-gpu-pod - /bin/bash

The output should look something like this:

In addition, the status can also be viewed in the GCP portal under:

Kubernetes -> cluster — Nodes- your GPU node- Overview, and should look similar to this:

Resource summary from the GCP portal for our GPU node

Now you can run cotainerized LLMS or other machine learning task on the GPU. I’ll publish some articles on how to containerzie LLMs later.

Don’t forget to delete the cluster when you no longer need it, otherwise you will incur significant costs.

The cluster can be deleted as usual:

terraform destroy

or

tf destroy

Keep in mind that with the destroy command every infrastucture will be deleted, which has definition files in the same directory as you put the gke-cluster definition in. Keep an eye on the output of the destroy command before you approve the execution by typing ‘yes’.

Summary

Now you can deploy and test suitable LLMs such as Mistral-7B locally with the GPU-enabled cluster. I will publish a separate article about this.

The framework can be easily adapted to other requirements and used as a basis for your own deployments.

Of course, the same cluster can also be easily deployed with a single CLI command, but the Terraform approach is closer to DevOps principles and can be integrated into a CI/CD pipeline with only minor modifications.

Contact me on LinkedIn https://www.linkedin.com/in/michaelhannecke/

If you have read it to this point, thank you! You are a hero (and a Nerd ❤)! I try to keep my readers up to date with “interesting happenings in the AI world,” so please 🔔 clap | follow