Binding GCP Accounts to GKE Service Accounts with Terraform

Published in

The Telegraph Engineering

13 min readMay 14, 2021

Kubernetes uses Service Accounts to control who can access what within the cluster, but once a request leaves the cluster, it will use a default account. Normally this is the default Google Compute Engine account in GKE, and this has extremely high-level access and could result in a lot of damage if your cluster is compromised.

In this article, I will be setting up a GKE cluster using a minimal access service account and enabling Workflow Identity.

Workflow Identity will enable you to bind a Kubernetes service account to a service account in GCP. You can then control GCP permissions of that account from within GCP — no RBAC/ABAC messing about needed (although you will still need to mess with RBAC/ABAC if you want to restrict that service account within Kubernetes, but that’s a separate article.)

What you will need for this tutorial:

A Google account
A Google Cloud account
Terraform on your local machine
kubectlon your local machine (can be installed as part of the Google Cloud SDK)
Google Cloud SDK on your local machine
A Google Cloud project setup
A service account with “Owner” permissions in your GCP project (the default compute engine account will normally work)
A credentials JSON file from that account — this can be generated using:
gcloud iam service-accounts keys create credentials.json --iam-account={iam-account-email}

We will start by setting up our Terraform provider

variable "project" {
  default = "REPLACE_ME"
}variable "region" {
  default = "europe-west2"
}variable "zone" {
  default = "europe-west2-a"
}provider "google" {
  project     = var.project
  region      = var.region
  zone        = var.zone
  credentials = file("credentials.json")
}

We define three variables here that we can reuse later — the project, region and zone. These variables you can adjust to match your own setup.

The provider block (provider "google" {..}) references those variables and also refers to the credentials.json file that will be used to create the resources in your account.

Next, we create the service account that we will bind to the cluster. This service account should contain minimal permissions as it will be the default account used by requests leaving the cluster. Only give it what is essential. You will notice I do not bind it to any roles.

resource "google_service_account" "cluster-serviceaccount" {
  account_id   = "cluster-serviceaccount"
  display_name = "Service Account For Terraform To Make GKE Cluster"
}

Now let’s define our cluster and node pool. This block can vary wildly on your circumstances, but I’ll use a Kubernetes 1.16 single-zone cluster, with a e2-medium node size and have autoscaling enabled

variable "cluster_version" {
  default = "1.16"
}resource "google_container_cluster" "cluster" {
  name               = "tutorial"
  location           = var.zone
  min_master_version = var.cluster_version
  project            = var.project  lifecycle {
    ignore_changes = [
      # Ignore changes to min-master-version as that gets changed
      # after deployment to minimum precise version Google has
      min_master_version,
    ]
  }  # We can't create a cluster with no node pool defined, but
  # we want to only use separately managed node pools. So we
  # create the smallest possible default node pool and
  # immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1
  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
  version = var.cluster_version  node_config {
    preemptible  = true
    machine_type = "e2-medium"    # Google recommends custom service accounts that have cloud-platform scope and
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]    metadata = {
      disable-legacy-endpoints = "true"
    }
  }  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }}

Let’s go through a few things on the above block:

variable "cluster_version" {
  default = "1.16"
}

Defines a variable we will use to describe the version of Kubernetes we want on the master and worker nodes.

resource "google_container_cluster" "cluster" {
  ...
  min_master_version = var.cluster_version
  ...
  lifecycle {
    ignore_changes = [
      min_master_version,
    ]
  }
  ...
}

The ignore_changes block here tells terraform not to pay attention to changes in the min_master_version field. This is because even though we declare we wanted 1.16 as the version, GKE will put a Kubernetes variant 1.16 onto the cluster. For example, the cluster might be created with version 1.16.9-gke.999 -- which is different to what Terraform expects, so if you were to run Terraform again, it would attempt to change the cluster version from 1.16.9-gke.999 to 1.16, cycling through the nodes again.

Next block to discuss:

resource "google_container_cluster" "cluster" {
  ...
  remove_default_node_pool = true
  initial_node_count       = 1
  ...
}

A GKE cluster must be created with a node pool. However it is easier to manage node pool separately, so this block tells Terraform to delete the default node pool when the cluster is created.

The final part of this block:

resource "google_container_cluster" "cluster" {
  ...
  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}

This enables Workload Identity and the namespace must be of the format {project}.svc.id.goog

Now let’s move onto the Node Pool definition:

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
  version = var.cluster_version  node_config {
    preemptible  = true
    machine_type = "e2-medium"    # Google recommends custom service accounts that have cloud-platform scope and 
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]    metadata = {
      disable-legacy-endpoints = "true"
    }
  }  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }}

Let’s go over a couple of blocks again:

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
 ...
}

This sets up autoscaling with a starting node count of 1 and a max node count of 5. Unlike with EKS, you don’t need to deploy the “autoscaler” into the cluster. Enabling this will natively allow Kubernetes to scale nodes up or down. The downside is you don’t see as many messages compared to the deployed version, so it’s sometimes harder to debug why a pod isn’t triggering a scaleup.

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  node_config {
    preemptible  = true
    machine_type = "e2-medium"    # Google recommends custom service accounts that have cloud-platform scope and
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
  ...
}

Here we define the node config, we’ve got this set as a pool of preemptible nodes, of type e2-medium. We tie the nodes to the service account defined earlier and give it only the cloud-platform scope.

The metadata block is needed as if you don’t specify it, the value disable-legacy-endpoints = "true" is assumed to be applied, and will cause the node pool to be respun each time you run terraform, as it thinks it needs to apply the updated config to the pool.

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }
}

Similar to the version field on the master node, we tell Terraform to ignore some fields if they have changed.

version we ignore for the same reason as on the master node -- the version deployed will be slightly different to the one we declared.
initial_node_count we ignore because if the node pool has scaled up, not ignoring this will cause terraform to attempt to scale the nodes back down to the initial_node_count value, causing pods to be sent into Pending
node_count we ignore for pretty much the same reason -- it will likely never be the initial value on a production system due to scale up.

With the basic skeleton setup, we can run Terraform to set up the stack. Yes, we haven’t actually bound anything to service accounts, but that will come later.

Let’s Terraform the infrastructure:

terraform init
terraform plan -out tfplan
terraform apply tfplan

Creation of the cluster can take between 5–15 minutes

Next, we need to get credentials and link them into the cluster

gcloud beta container clusters get-credentials tutorial --zone {cluster-zone} --project {project}

gcloud beta container clusters get-credentials tutorial --region {cluster-region} --project {project}

You should get some output like this:

Fetching cluster endpoint and auth data.
kubeconfig entry generated for tutorial.

Now you should be able to run kubectl get pods --all-namespaces to see what's in your cluster (should be nothing other than the default system pods)

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                                             READY   STATUS    RESTARTS   AGE
kube-system   event-exporter-gke-666b7ffbf7-lw79x                              2/2     Running   0          13m
kube-system   fluentd-gke-scaler-54796dcbf7-6xnsg                              1/1     Running   0          13m
kube-system   fluentd-gke-skmsq                                                2/2     Running   0          4m23s
kube-system   gke-metadata-server-fsxj6                                        1/1     Running   0          9m29s
kube-system   gke-metrics-agent-pfdbp                                          1/1     Running   0          9m29s
kube-system   kube-dns-66d6b7c877-wk2nt                                        4/4     Running   0          13m
kube-system   kube-dns-autoscaler-645f7d66cf-spz4c                             1/1     Running   0          13m
kube-system   kube-proxy-gke-tutorial-tutorial-cluster-node-po-b531f1ee-8kpj   1/1     Running   0          9m29s
kube-system   l7-default-backend-678889f899-q6gsl                              1/1     Running   0          13m
kube-system   metrics-server-v0.3.6-64655c969-2lz6v                            2/2     Running   3          13m
kube-system   netd-7xttc                                                       1/1     Running   0          9m29s
kube-system   prometheus-to-sd-w9cwr                                           1/1     Running   0          9m29s
kube-system   stackdriver-metadata-agent-cluster-level-566c4b7cf9-7wmhr        2/2     Running   0          4m23s

Now let’s do our first test. We’ll use gsutil to run a list of GS buckets on our project.

kubectl run --rm -it test --image gcr.io/cloud-builders/gsutil ls

This will run a docker image with “gsutil” in it and then remove the container when the command finishes.

The output should be something like this:

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 Caller does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-68bb69b777-5nzgt -c test -i -t' command when the pod is running
deployment.apps "test" deleted

As you can see, we get a 403. The default service account doesn’t have permissions to access Google Storage.

Now let’s set up the service account we will use for binding:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

Again, let’s go through the blocks:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

This block defines the service account in GCP that will be binding to.

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

This block assigns the Storage Admin role to the service account we just created — essentially it is putting the service account in the Storage Admin group. Think of it more like adding the account to a group rather than assigning permission or role to the account.

resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

This block adds the service account as a Workload Identity User. You’ll notice that the member field is a bit confusing. The ${var.project}.svc.id.goog bit indicates that it is a Workflow Identity namespace and the bit in [...] is the name of the Kubernetes service account we want to allow to be bound to this. This membership and an annotation on the service account (described below) will allow the service account in Kubernetes to essentially impersonate the service account in GCP and you will see this in the example.

With the service account setup in Terraform, let’s run the Terraform apply steps again

terraform plan -out tfplan
terraform apply tfplan

Assuming it didn’t error, we now have one half of the binding — the GCP service account. We now need to create the service account inside Kubernetes.

You’ll recall that we had a piece of data in the [...]: workload-identity-test/workload-identity-user this is our service account that we need to create. Below is the YAML for creating the namespace and the service account. Save this into the file workload-identity-user.yaml:

apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: null
  name: workload-identity-test
spec: {}
status: {}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: workload-identity-tutorial@{project}.iam.gserviceaccount.com
  name: workload-identity-user
  namespace: workload-identity-test

The important thing to note is the annotation on the service account:

annotations:
    iam.gke.io/gcp-service-account: workload-identity-tutorial@{project}.iam.gserviceaccount.com

The annotation references the service account created by the Terraform block:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

So the Kubernetes service account references the GCP service account and the GCP service references the Kubernetes service account.

Important Note: If you do not do the double referencing — for example, if you forget to include the annotation on the service account or forget to put the referenced Kubernetes service account in the Workload Identity member block, then GKE will use the default service account specified on the node.

Now it’s time to put it to the test. If everything is set up correctly, run the previous test again:

kubectl run --rm -it test --image gcr.io/cloud-builders/gsutil ls

You should still get a 403 but with a different error message.

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 Primary: /namespaces/{project}.svc.id.goog with additional claims does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-68bb69b777-8ltvc -c test -i -t' command when the pod is running
deployment.apps "test" deleted

Let’s now create the service accounts. This file should have been created by the earlier step:

$ kubectl apply -f workload-identity.yaml
namespace/workload-identity-test created
serviceaccount/workload-identity-user created

So now let’s run the test again but this time, we specify the service account and also the namespace as a service account is tied to the namespace it resides in — in this case, the namespace of our service account is workload-identity-test

kubectl run -n workload-identity-test --rm --serviceaccount=workload-identity-user -it test --image gcr.io/cloud-builders/gsutil ls

If you’re running a later version of Kubernetes and/or kubectl, you may get this error:

Flag --serviceaccount has been deprecated, has no effect and will be removed in 1.24.

In this case, you need to use the --overrides flag instead:

kubectl run -it --rm -n workload-identity-test test --overrides='{ "apiVersion": "v1", "spec": { "serviceAccount": "workload-identity-test" }  }' --image gcr.io/cloud-builders/gsutil ls

The output will show the buckets you have:

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
gs://backups/
gs://snapshots/
Session ended, resume using 'kubectl attach test-66754998f-sp79b -c test -i -t' command when the pod is running
deployment.apps "test" deleted

Let’s now change the permissions on the GCP service account to prove it’s the one being used.

Change this block:

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

And change the active role like so:

resource "google_project_iam_member" "storage-role" {
  # role = "roles/storage.admin"        ## <-- comment this out
  role   = "roles/storage.objectAdmin"  ## <-- uncomment this
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

Run the terraform actions again:

terraform plan -out tfplan
terraform apply tfplan

Allow a few minutes for the change to propagate then run the test again (refer to earlier if you get the error on the --serviceaccount flag):

kubectl run -n workload-identity-test --rm --serviceaccount=workload-identity-user -it test --image gcr.io/cloud-builders/gsutil lskubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 workload-identity-tutorial@{project}.iam.gserviceaccount.com does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-66754998f-k5dm5 -c test -i -t' command when the pod is running
deployment.apps "test" deleted

And there you have it, the service account in the cluster: workload-identity-test/workload-identity-user is bound to the service account workload-identity-tutorial@{project}.iam.gserviceaccount.com on GCP, carrying the permissions it also has.

If the service account on Kubernetes is compromised in some way, you just need to revoke the permissions on the GCP service account and the Kubernetes service account no longer has any permissions to do anything in GCP.

For simplicity, here’s the Terraform used for this tutorial. Replace what you need — you can move things around and separate them into other Terraform files if you wish — I kept it in one file for simplicity.

variable "project" {
  default = "REPLACE_ME"
}variable "region" {
  default = "europe-west2"
}variable "zone" {
  default = "europe-west2-a"
}provider "google" {
  project     = var.project
  region      = var.region
  zone        = var.zone
  credentials = file("credentials.json")
}resource "google_service_account" "cluster-serviceaccount" {
  account_id   = "cluster-serviceaccount"
  display_name = "Service Account For Terraform To Make GKE Cluster"
}variable "cluster_version" {
  default = "1.16"
}resource "google_container_cluster" "cluster" {
  name               = "tutorial"
  location           = var.zone
  min_master_version = var.cluster_version
  project            = var.project  lifecycle {
    ignore_changes = [
      # Ignore changes to min-master-version as that gets changed
      # after deployment to minimum precise version Google has
      min_master_version,
    ]
  }  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1
  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
  version = var.cluster_version  node_config {
    preemptible  = true
    machine_type = "e2-medium"    # Google recommends custom service accounts that have cloud-platform scope
    # and permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]    metadata = {
      disable-legacy-endpoints = "true"
    }  }  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }}resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin" 
  # role   = "roles/storage.objectAdmin" 
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}
resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

Binding GCP Accounts to GKE Service Accounts with Terraform

What you will need for this tutorial:

Written by Johnny Ooi