Making GitHub workflows to deploy to GKE with Terraform and Workload Identity Federation
In the first part of this tutorial, we created a Pub/Sub handler in Dart and deployed it to GKE. Let’s take that to the production level. We will:
- Create a Google Cloud project programmatically for each environment (prod, dev, test, etc.) because that’s the best isolation.
- Set up Terraform to create and upgrade Pub/Sub topics, GKE cluster, and other resources.
- Do all of that in GitHub workflows to set up a stage for each pull request.
- Run auto-tests on that stage.
- Use Workload Identity Federation to avoid any use of service account keys.
- Minimize the granted permissions.
The Architecture
A dedicated service account named project-creator creates a Google Cloud project and a deploy service account within it, which acts from that point on and creates all the services:
The services and their interaction were described in the first article, and we will recap them as we configure the deployment.
“Capitalizer” is the name of our hello-world workload which takes input from Pub/Sub subscriptions, turns it to upper-case, and writes to a Pub/Sub topic, we did that in the first part.
While the first part had many things specific to Dart, this second part is mostly agnostic to the workload language because we operate a container here.
The code for this tutorial is in this GitHub repository, same as the first part.
Roles and Permissions
An atomic unit of privileges in Google Cloud is a permission. When you grant someone a role, the roles that you see are predefined by Google and often contain excessive permissions. It’s fine to use such roles in development, but for production, you should carefully construct your own roles from the minimal required set of permissions.
Here you can check which permissions most of the predefined roles contain.
For this setup, we will have 3 levels of roles, each one is progressively weaker.
My Project Creator
This role will be used to create projects programmatically within the organization.
An organization is an entity above your projects. It’s created when you use your domain name for Google Workspace or Cloud Identity. If you don’t have it yet, create one. This tutorial requires an organization.
On your organization’s level, create a role and set:
- Name: “My Project Creator”
- ID:
MyProjectCreator
Add the following permissions:
resourcemanager.projects.create(from the built-in Project Creator role)billing.resourceAssociations.create(from the built-in Billing Account User role)
When a project is created using this role, the creator is granted the Owner role on the new project. This allows the creator to set up everything else. However, such a service account is too strong because it can create other projects. So we should stop using it as soon as possible.
Owner
We will be using the built-in role “Owner” for the service account named deploy. Terraform will be using it when creating and configuring all the services.
This role may seem too powerful as well, but this is because we need to be able to create an even weaker service account for the runner, and sadly, this permission comes with the ability to make yourself an owner, so there’s no point using any role weaker than Owner.
Runner Roles
At the bottom are the roles for the runner. They will be created on the project level by Terraform. This is because you should have declarative management as much as possible.
In this project, we will have:
MyPubSubConsumergranted on the specific subscription.MyPubSubPublishergranted on the specific topic.
They are more narrow than the built-in roles.
If your project has multiple runners, you need roles for each of them. This is the classic three-level role deployment:
Enabling the programmatic project creation
We need permanent environments: prod and dev. They need to be created only once. But we also want to set up a copy of the system for each pull request to run automated tests on it. For that, we need to create projects programmatically.
1. Select or create a master project
To create projects programmatically, a service account must be used. You will give it access to the organization for that.
A service account must have its home project, which you should create or choose. It could be any, but the following is important:
- A project gives its ID to service accounts like this:
account-name@project-id.iam.gserviceaccount.com
Make sure this naming is clear to anyone who will be reviewing the permissions a few years later. - You should not delete this project.
- Any APIs that the service account should manage on other projects must be enabled on its home project (see this documentation bug and this StackOverflow question).
This is why you probably need a dedicated master project for your organization just to be the home for your service accounts that create other projects.
2. Create a service account to create projects
In your master project, create a service account named project-creator. Don’t give it any roles on the project.
3. Create the service account key
Create a JSON key for the service account and put it under/home/user/dart-pubsub-gke-demo/keys/project-creator.json
Here and on we assume you checked out the example repository to/home/user/dart-pubsub-gke-demo
We will not upload this key anywhere. This is only for your local experiments in a terminal.
4. Find your organization ID
On any page in Google Cloud console, click the dropdown where you normally switch projects. Your organization should be on that list among projects:
5. Find your billing account ID
From the services list in the hamburger menu, select “Billing” and find the billing account ID:
6. Assign the role to the service account
As of writing, Google Cloud console does not allow you to assign custom roles to service accounts on the organization level, so this should be done in a terminal. Temporarily sign in to gcloud as your normal Google account:
gcloud auth loginThis will open your default browser. Confirm your login there. When complete, run this:
export MASTER_PROJECT=master-project-id
export ORGANIZATION=your-organization-id
gcloud organizations \
add-iam-policy-binding $ORGANIZATION \
--member="serviceAccount:project-creator@$MASTER_PROJECT.iam.gserviceaccount.com" \
--role="organizations/$ORGANIZATION/roles/MyProjectCreator"Run this to sign out of gcloud with your personal account:
gcloud auth revoke --all7. Enable the APIs in the master project
Find and enable the following services on this page:
https://console.cloud.google.com/apis/library
- Cloud Billing API
- Identity and Access Management (IAM) API
Trying the project creation in a terminal
Before writing a GitHub workflow, test the commands in a terminal.
We will first build the container image, and then set up all the infrastructure.
The image is the priority because this is the fastest way to detect possible compile-time errors in your workload. This is done in seconds while infrastructure takes minutes to set up.
1. Minimal preparation for build
Create a project:
export ORGANIZATION=your-organization-id
export PROJECT=your-project-id
export REGION=us-central1
export ZONE=us-central1-c
gcloud auth activate-service-account --key-file=keys/project-creator.json
gcloud projects \
create $PROJECT \
--name=$PROJECT \
--organization=$ORGANIZATION
gcloud billing \
projects link $PROJECT \
--billing-account=$BILLING_ACCOUNTEnable the services:
gcloud services enable artifactregistry.googleapis.com --project=$PROJECT
gcloud services enable cloudbuild.googleapis.com --project=$PROJECT
gcloud services enable cloudresourcemanager.googleapis.com --project=$PROJECT
gcloud services enable container.googleapis.com --project=$PROJECT
gcloud services enable iam.googleapis.com --project=$PROJECT
gcloud services enable pubsub.googleapis.com --project=$PROJECTCreate the image repository:
gcloud artifacts \
repositories create my-repository \
--repository-format=DOCKER \
--location=$REGION \
--project=$PROJECT2. Automate the versioning
In the first article, we were manually setting a version for the container image. In production, we will be composing the version from these parts:
- The Dart application version from pubspec.yaml
- Date and time
- Commit hash
Run this:
APP_VERSION=$(grep '^version:' capitalizer/pubspec.yaml | awk '{print $2}'); \
TIMESTAMP=$(date -u +%Y%m%d-%H%M%S); \
COMMIT_HASH=$(git rev-parse HEAD); \
export VERSION="v$APP_VERSION-$TIMESTAMP-$COMMIT_HASH"
echo $VERSIONYou will see something like this:v1.2.3-20240308–112258–58c87415eba5091d4534c903f84d80d441ecc1d1
3. Build the image
Submit a build:
export REPOSITORY=my-repository
SUBSTITUTIONS=(
"_VERSION=$VERSION"
"_DART_VERSION=3.3.1"
"_REGION=$REGION"
"_REPOSITORY=$REPOSITORY"
); gcloud builds \
submit \
--project=$PROJECT \
--substitutions="$(echo $(IFS=,; echo "${SUBSTITUTIONS[*]}"))" \
--config=capitalizer/cloudbuild.yaml \
capitalizerWhen this command finishes, the image will be in the artifact registry.
4. Prepare the project for Terraform
Create a Cloud Storage bucket to store the state of Terraform deployment:
gcloud storage \
buckets create "gs://$PROJECT-tf-state" \
--location=$REGION \
--uniform-bucket-level-access \
--project=$PROJECTCreate a service account for Terraform:
gcloud iam \
service-accounts create "deploy" \
--project=$PROJECT
gcloud projects \
add-iam-policy-binding $PROJECT \
--member="serviceAccount:deploy@$PROJECT.iam.gserviceaccount.com" \
--role="roles/owner" \
--project=$PROJECTTerraform
When the project is created and the required services are enabled, we can configure each service. Terraform does this in a declarative way, so when you will be expanding your project, try to do as much work here as possible.
The Terraform configuration for the project is in /infrastructure/terraform directory. It has 3 modules, each depending on the previous one:
Module 1: basics
The basics module configures all simple resources:
- Pub/Sub topics and subscriptions:
resource "google_pubsub_topic" "input" {
name = "input"
message_retention_duration = "86400s"
}
resource "google_pubsub_subscription" "input-sub" {
name = "input-sub"
topic = google_pubsub_topic.input.name
ack_deadline_seconds = 600
}
resource "google_pubsub_topic" "output" {
name = "output"
message_retention_duration = "86400s"
}
resource "google_pubsub_subscription" "output-sub" {
name = "output-sub"
topic = google_pubsub_topic.output.name
ack_deadline_seconds = 600
}- Custom roles:
resource "google_project_iam_custom_role" "MyPubSubConsumer" {
role_id = "MyPubSubConsumer"
title = "MyPubSubConsumer"
description = "A minimal role to consume messages. It's weaker than the built-in 'Pub/Sub Subscriber' by not allowing to create subscriptions. It's stronger by allowing to list the subscriptions because Google client library lists subscriptions before consuming."
permissions = [
"pubsub.subscriptions.consume",
"pubsub.subscriptions.get",
]
}
resource "google_project_iam_custom_role" "MyPubSubPublisher" {
role_id = "MyPubSubPublisher"
title = "MyPubSubPublisher"
description = "A minimal role to publish messages. It's stronger than the built-in 'Pub/Sub Publisher' by allowing to list the topics because Google client library lists topics before publishing."
permissions = [
"pubsub.topics.get",
"pubsub.topics.publish",
]
}- Service account named
capitalizerand its role bindings to the specific subscription and topic:
resource "google_service_account" "capitalizer" {
account_id = "capitalizer"
}
resource "google_pubsub_subscription_iam_member" "capitalizer_input-sub_MyPubSubConsumer" {
subscription = google_pubsub_subscription.input-sub.name
role = google_project_iam_custom_role.MyPubSubConsumer.name
member = google_service_account.capitalizer.member
}
resource "google_pubsub_topic_iam_member" "capitalizer_output_MyPubSubPublisher" {
topic = google_pubsub_topic.output.name
role = google_project_iam_custom_role.MyPubSubPublisher.name
member = google_service_account.capitalizer.member
}This module also creates gke-minimal service account, which we didn’t have in the previous article. Here is why.
In GKE, each node is authenticated to some service account. By default, nodes use the Compute Engine default service account, which by default has the Editor role on the project. This means that all software in the nodes can do anything to your project, which is unacceptable in production.
That’s why we create gke-minimal service account instead and assign it to the nodes. It has the least possible permissions plus the role to pull images from the registry. All of that is configured in this module:
resource "google_service_account" "gke-minimal" {
account_id = "gke-minimal"
description = "For GKE nodes. Uses the minimal permissions + ability to read images."
# https://cloud.google.com/kubernetes-engine/docs/how-to/hardening-your-cluster#use_least_privilege_sa
}
resource "google_project_iam_member" "gke-minimal_logWriter" {
project = var.PROJECT
role = "roles/logging.logWriter"
member = google_service_account.gke-minimal.member
}
resource "google_project_iam_member" "gke-minimal_monitoring_metricWriter" {
project = var.PROJECT
role = "roles/monitoring.metricWriter"
member = google_service_account.gke-minimal.member
}
resource "google_project_iam_member" "gke-minimal_monitoring_viewer" {
project = var.PROJECT
role = "roles/monitoring.viewer"
member = google_service_account.gke-minimal.member
}
resource "google_project_iam_member" "gke-minimal_stackdriver_resourceMetadata_writer" {
project = var.PROJECT
role = "roles/stackdriver.resourceMetadata.writer"
member = google_service_account.gke-minimal.member
}
resource "google_project_iam_member" "gke-minimal_autoscaling_metricsWriter" {
project = var.PROJECT
role = "roles/autoscaling.metricsWriter"
member = google_service_account.gke-minimal.member
}
resource "google_project_iam_member" "gke-minimal_artifactregistry_reader" {
project = var.PROJECT
role = "roles/artifactregistry.reader"
member = google_service_account.gke-minimal.member
}We could play a nerd and pull out the specific permissions we actually need from these roles to build an even more restrictive custom role. However, the added benefit is too small compared to the potential debugging problems when we hit a missed permission for some container maintanence operation later, or if GKE changes its behavior in a minor way.
Do you have an experience narrowing this down? Let me know in the comments.
Module 2: gke
The module gke configures a GKE cluster, which takes about 10 minutes. It’s extracted to a separate module so that its creation is not started if anything breaks in basics. This allows the workflow to break 10 minutes earlier.
resource "google_container_cluster" "my-cluster" {
name = var.CLUSTER
location = var.ZONE
workload_identity_config {
workload_pool = "${var.PROJECT}.svc.id.goog"
}
initial_node_count = 1
remove_default_node_pool = true
deletion_protection = false
node_config {
service_account = var.SERVICE_ACCOUNT_EMAIL
}
}
resource "google_container_node_pool" "my-pool" {
name = "my-pool"
location = var.ZONE
cluster = google_container_cluster.my-cluster.name
node_count = 1
node_config {
machine_type = "e2-medium"
service_account = var.SERVICE_ACCOUNT_EMAIL
}
}Module 3: deployment
The module deployment deploys the built image into the cluster. It’s extracted to a separate module because this is the only way to dynamically use the credentials of the cluster created in gke module.
resource "kubernetes_service_account" "capitalizer" {
metadata {
name = "capitalizer"
annotations = {
"iam.gke.io/gcp-service-account" = var.CAPITALIZER_EMAIL
}
}
}
resource "google_service_account_iam_member" "workload_identity_user" {
service_account_id = "projects/${var.PROJECT}/serviceAccounts/${var.CAPITALIZER_EMAIL}"
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.PROJECT}.svc.id.goog[default/${kubernetes_service_account.capitalizer.metadata[0].name}]"
}
resource "kubernetes_deployment" "dart-pubsub-gke-demo" {
metadata {
name = var.DEPLOYMENT_NAME
}
spec {
replicas = 1
selector {
match_labels = {
app = var.DEPLOYMENT_NAME
}
}
template {
metadata {
labels = {
app = var.DEPLOYMENT_NAME
}
}
spec {
service_account_name = kubernetes_service_account.capitalizer.metadata[0].name
container {
name = "capitalizer"
image = "${var.REGION}-docker.pkg.dev/${var.PROJECT}/${var.REPOSITORY}/capitalizer:${var.VERSION}"
env {
name = "PROJECT"
value = var.PROJECT
}
}
}
}
}
}Note that we don’t use the secret in the cluster. Instead, we use Workload Identify Federation, which works like this.
Kubernetes has its own service accounts (KSA), not to be confused with Google Cloud service accounts. KSAs are used by pods to access Kubernetes API within a cluster. We create capitalizer service account here and assign it to the pod in the deployment. This means that everything that’s done in the pod happens on behalf of this service account.
We also link that KSA to the Google Cloud service account by granting the role roles/iam.workloadIdentityUser on the Google service account to the synthetic member with this long name:
serviceAccount:${var.PROJECT}.svc.id.goog[default/${kubernetes_service_account.capitalizer.metadata[0].name}]
Since all of this happens within Google Cloud, the cloud resources trust each other and handle all short-lived tokens for us so we don’t need to store service account keys as secrets.
Deploying
Run this to create and deploy everything:
cd infrastructure/terraform
envsubst < backend.tf.template > backend.tf
terraform init
terraform apply \
-var="PROJECT=$PROJECT" \
-var="REGION=$REGION" \
-var="ZONE=$ZONE" \
-var="VERSION=$VERSION"
cd ../..When these commands finish, everything will be up and running.
Verifying
Make kubectl use the GKE cluster:
gcloud container \
clusters get-credentials $CLUSTER --zone=$ZONE --project=$PROJECTList the pods:
kubectl get podsYou should see something like this:
NAME READY STATUS RESTARTS AGE
capitalizer-799b786987-4zhzk 1/1 Running 0 7sTo view the stdout of the app:
kubectl logs $(kubectl get pods -o name | grep dart-pubsub-gke-demo | head -n 1)It should show you the output of the Dart code:
Project ID: your-project-id
Looked up: Instance of '_SubscriptionImpl', Instance of '_TopicImpl'
Pulling.
Event: null
Idle.
Pulling.
Event: null
Idle.
Pulling.Test
Run this:
$ cd capitalizer
$ GOOGLE_APPLICATION_CREDENTIALS='../keys/project-creator.json' dart test
00:00 +0: test/main_test.dart: Publish and read the result
Project ID: your-project-id
Purging the output subscription.
Event: Instance of '_PullEventImpl'
00:11 +1: All tests passed!The GitHub Workflows
Time to make all of that run automatically.
Configure Workload Identity Federation on the master project
We have already been using Workload Identity Federation to make the app inside a Kubernetes pod access the cloud resources under capitalizer service account without storing its key.
Now we will use it again to authenticate a GitHub workflow to use project-creator service account.
Workload Identity Federation works like magic. GitHub assumes the burden of proving to Google that the request comes from GitHub and from a specific user in a specific repository. You use that verified info to allow the use of the service account. This way, a service account key is not stored on GitHub, it not even need to be issued.
Here is the one-time setup. Define the GitHub users who should be allowed to run the CI/CD, and other constants. Note the nested quotes:
export GITHUB_USERS=(
"'username1'"
"'username2'"
)
export MASTER_PROJECT=master-project-id
export REPO=username1/dart-pubsub-gke-demoThen find the number of your master project. In Google Cloud, projects have numbers in addition to string IDs. Unfortunately, we need both because Workload Identity Federation requires the number to authenticate, and it can’t be determined before you authenticate.
To get the number, sign in as your normal Google account and run:
gcloud projects describe $MASTER_PROJECT --format='value(projectNumber)'Then export it:
export MASTER_PROJECT_NUMBER=123Continue with your normal Google account:
gcloud iam \
workload-identity-pools create "github" \
--location="global" \
--project=$MASTER_PROJECT
MAPPING=(
'google.subject=assertion.sub'
'attribute.actor=assertion.actor'
'attribute.repository=assertion.repository'
);gcloud iam \
workload-identity-pools providers create-oidc "github" \
--attribute-mapping="$(echo $(IFS=,; echo "${MAPPING[*]}"))" \
--issuer-uri="https://token.actions.githubusercontent.com" \
--location="global" \
--workload-identity-pool="github" \
--project=$MASTER_PROJECT
GITHUB_USERS_STR=$(IFS=,; echo "${GITHUB_USERS[*]}") \
envsubst < infrastructure/project-creator-policy.json | \
gcloud iam \
service-accounts set-iam-policy \
project-creator@$MASTER_PROJECT.iam.gserviceaccount.com \
/dev/stdinThis will apply the following policy from project-creator-policy.json:
{
"bindings": [
{
"members": [
"principalSet://iam.googleapis.com/projects/${MASTER_PROJECT_NUMBER}/locations/global/workloadIdentityPools/github/attribute.repository/${REPO}"
],
"role": "roles/iam.workloadIdentityUser",
"condition": {
"title": "GitHub users whitelist",
"expression": "request.auth.claims.attribute.actor in [${GITHUB_USERS_STR}]"
}
}
]
}To update the whitelist of users later, run the same command with the new list.
Point GitHub to your master project
Go to the secrets of your repository and create the following:
BILLING_ACCOUNTMASTER_PROJECTMASTER_PROJECT_NUMBERORGANIZATION
These are not extremely sensitive, because no harm can be done with them alone. If they are leaked, it mostly poses phishing and resource enumeration risks. Anyway, the master project is so important for your organization that you want less attention to it.
Test Workload Identity Federation
This example comes with a short workflow workload_identity_federation_min with a bare minimal test of Workload Identity Federation:
on:
- workflow_dispatch
jobs:
_:
permissions:
id-token: write
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: 'google-github-actions/auth@v2'
with:
workload_identity_provider: 'projects/${{ secrets.MASTER_PROJECT_NUMBER }}/locations/global/workloadIdentityPools/github/providers/github'
service_account: 'project-creator@${{ secrets.MASTER_PROJECT }}.iam.gserviceaccount.com'
- uses: 'google-github-actions/setup-gcloud@v2'
- run: gcloud info
- run: gcloud projects list > /dev/nullOn GitHub, go to Actions, select the workflow, and run it:
When it finishes, it should be green. Open the run and check the output:
The output has a few steps that mention project-creator service account. They don’t yet mean authentication, they are just the attempted credentials. The true test happens in this step:
gcloud projects list > /dev/nullIt lists all the projects that the service account has access to. This is the easiest way to see if the workflow was authenticated.
For project-creator, this command should list every project this service account has created that currently exists, which may be sensitive. Therefore, we discard the output by using > /dev/null
All we care for is that this command does not raise an error.
Test the conditions
To see that your repository name is actually tested by Google Cloud, update the policy to make the expected repository name different from your actual repository:
export REPO=invalid-repository
GITHUB_USERS_STR=$(IFS=,; echo "${GITHUB_USERS[*]}") \
envsubst < infrastructure/project-creator-policy.json | \
gcloud iam \
service-accounts set-iam-policy \
project-creator@$MASTER_PROJECT_NAME.iam.gserviceaccount.com \
/dev/stdinWait a few minutes for the changes to take effect, and then run the workflow again. You should see an error at this line:
Verify the username validation the same way. Revert the repository name back to normal and change the user whitelist so that the only wrong thing is your username. Run the workflow. You should see the same error.
Set up the deployment workflows
Add the following repository variables (not secrets), use the values you were supplying earlier in the command line:
DART_VERSION(for example,3.3.1)REGION(for example,us-central1)TERRAFORM_VERSION(for example,1.7.5)ZONE(for example,us-central1-c)
We will be using the following three top-level workflows:
stage_deleteis the longest workflow. It will set up a transient copy of the system, run the test, and then delete the project. It can run on all pull requests if you want to.stageis the same except that it will not delete the project when finished. You can use it to make a stage to run manual tests on it and delete it when you are done.deploy_proddeploys the production project.
These top-level workflows are built of the following reusable pieces, each is a callable workflow. The top-level workflows differ in how they wire these pieces.
Reusable workflow: generate_deployment_values
This workflow generates and returns important values that are used in subsequent steps: timestamp and the temporary project ID.
The timestamp is important because multiple values are derived from it: the temporary project ID and the container version. It has the format that we used when generating a container version earlier: 20240308–112258
The temporary project ID is composed of the timestamp and the short commit hash like this: p-20240327-055155-9b37754
Not all top-level workflows will be using the temporary project ID that we export here. For example, deploy_prod has a fixed project ID and will discard this parameter.
on:
workflow_call:
outputs:
project:
description: 'A temporary project ID'
value: ${{ jobs._.outputs.project }}
timestamp:
description: 'Timestamp in the format of YYYYMMDD-HHMMSS in UTC'
value: ${{ jobs._.outputs.timestamp }}
jobs:
_:
runs-on: ubuntu-latest
outputs:
project: ${{ steps.step1.outputs.project }}
timestamp: ${{ steps.step1.outputs.timestamp }}
steps:
- uses: actions/checkout@v4
- name: 'Generate the deployment values'
id: step1
run: |
export TIMESTAMP=$(date -u +%Y%m%d-%H%M%S)
export COMMIT_HASH=$(git rev-parse --short HEAD)
export PROJECT="p-$TIMESTAMP-$COMMIT_HASH"
echo "Timestamp: $TIMESTAMP"
echo "Project: $PROJECT"
echo "timestamp=$TIMESTAMP" >> $GITHUB_OUTPUT
echo "project=$PROJECT" >> $GITHUB_OUTPUTReusable workflow: maybe_create_project
The simplest idea of this workflow is the following. It accepts the project ID (a string) and tries to create it. It ignores the error if the project already exists, which will happen for persistent environments. Then it gets the project number by the ID and returns it. Remember, we need the project number later to authenticate with Workload Identity Federation using deploy service account.
However, this simple plan will not work when we revoke access to the production project from project-creator for security reasons. The workflow will just be unable to read the project number.
This means we should make it a bit trickier and add an optional input parameter with the project number. When this parameter is given, the workflow does nothing but return it.
This design supports all of these use cases:
- When you call this workflow for a transient stage and only pass the ID, it creates the project and returns its number.
- When you call this workflow for a persistent environment for the first time, it’s the same. It creates the project and returns its number.
- When you call this workflow for a persistent environment subsequently, the project already exists. The creation error is ignored, and the number is returned.
- When you call this workflow for a high-security persistent environment, you pass the project number as input, and the workflow does nothing but return this number.
Why bother calling this workflow in the latter case? For two reasons:
- This makes
deploy_prodandstageworkflows more similar. It’s easier to always rely on the project number returned by this workflow instead of keeping track of what is the source of truth for the project number in all scenarios. - An alternative would be to modify
deploy_prodafter the initial deployment to skip the creation step.
Here is the workflow:
on:
workflow_call:
inputs:
project:
required: true
type: string
project_number:
description: 'If given, will be returned without attempting to create a project.'
required: false
type: string
outputs:
project_number:
description: 'The number of either the new project or the one passed as input.'
value: ${{ jobs._.outputs.project_number }}
secrets:
MASTER_PROJECT:
required: true
MASTER_PROJECT_NUMBER:
required: true
jobs:
_:
runs-on: ubuntu-latest
permissions:
id-token: write
outputs:
project_number: ${{ steps.export_project_number.outputs.value }}
steps:
- name: 'Export the environment variables'
run: |
echo "PROJECT=${{ inputs.project }}" >> $GITHUB_ENV
echo "PROJECT_NUMBER=${{ inputs.project_number }}" >> $GITHUB_ENV
- uses: 'google-github-actions/auth@v2'
if: env.PROJECT_NUMBER == ''
with:
workload_identity_provider: 'projects/${{ secrets.MASTER_PROJECT_NUMBER }}/locations/global/workloadIdentityPools/github/providers/github'
service_account: 'project-creator@${{ secrets.MASTER_PROJECT }}.iam.gserviceaccount.com'
- name: 'Create the Google Cloud Project if it does not exist'
if: env.PROJECT_NUMBER == ''
run: |
set +e # Continue on error.
gcloud projects \
create $PROJECT \
--name=$PROJECT \
--organization=${{ secrets.ORGANIZATION }}
true # Exits with zero code so that the step is considered successful.
- name: 'Get the project number'
if: env.PROJECT_NUMBER == ''
run: |
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT --format='value(projectNumber)')
echo "Project Number: $PROJECT_NUMBER"
echo "PROJECT_NUMBER=$PROJECT_NUMBER" >> $GITHUB_ENV
- name: 'Export the project number'
id: export_project_number
run: |
echo "value=$PROJECT_NUMBER" >> $GITHUB_OUTPUTYou can see that here and below we export the workflow inputs as environment variables. This is not required for most of them, and you can refer to the inputs in the workflow steps directly. However, the use of environment variables allows you to copy the step commands to a terminal for debugging.
Reusable workflow: maybe_configure_project
Here we will do everything that needs to be done before the less-privileged deploy service account can take over.
Why separate this from the previous workflow? For reliability reasons. The previous workflow needs to terminate as soon as it can return the project number. Otherwise, any error in its subsequent steps will prevent it from returning the number which may be needed for clean-up procedures.
Note that we start with probing the access to the project. If we don’t have it, we assume that the access was revoked for project-creator service account and that the project is already fine.
on:
workflow_call:
inputs:
project:
required: true
type: string
project_number:
required: true
type: string
secrets:
BILLING_ACCOUNT:
required: true
MASTER_PROJECT:
required: true
MASTER_PROJECT_NUMBER:
required: true
jobs:
_:
runs-on: ubuntu-latest
permissions:
id-token: write
steps:
- uses: actions/checkout@v4
- name: 'Export the environment variables'
run: |
echo "PROJECT=${{ inputs.project }}" >> $GITHUB_ENV
echo "PROJECT_NUMBER=${{ inputs.project_number }}" >> $GITHUB_ENV
echo "REPO=${{ github.repository }}" >> $GITHUB_ENV
- uses: 'google-github-actions/auth@v2'
with:
workload_identity_provider: 'projects/${{ secrets.MASTER_PROJECT_NUMBER }}/locations/global/workloadIdentityPools/github/providers/github'
service_account: 'project-creator@${{ secrets.MASTER_PROJECT }}.iam.gserviceaccount.com'
- name: 'Try to access the project'
id: access
run: |
set +e
gcloud projects describe $PROJECT
[ $? -eq 0 ] && echo "ok=true" >> $GITHUB_OUTPUT || echo "ok=false" >> $GITHUB_OUTPUT
- name: 'Set the billing account'
if: steps.access.outputs.ok == 'true'
run: |
gcloud billing \
projects link $PROJECT \
--billing-account=${{ secrets.BILLING_ACCOUNT }}
- name: 'Enable the services'
if: steps.access.outputs.ok == 'true'
run: |
gcloud services enable artifactregistry.googleapis.com --project=$PROJECT
gcloud services enable cloudbuild.googleapis.com --project=$PROJECT
gcloud services enable cloudresourcemanager.googleapis.com --project=$PROJECT
gcloud services enable container.googleapis.com --project=$PROJECT
gcloud services enable iam.googleapis.com --project=$PROJECT
gcloud services enable pubsub.googleapis.com --project=$PROJECT
- name: 'Check if the service account for Terraform exists'
if: steps.access.outputs.ok == 'true'
id: probe_deploy_service_account
run: |
set +e
gcloud iam service-accounts list --project=$PROJECT | grep deploy@$PROJECT.iam.gserviceaccount.com
[ $? -eq 0 ] && echo "exists=true" >> $GITHUB_OUTPUT || echo "exists=false" >> $GITHUB_OUTPUT
- name: 'Create the service account for Terraform'
if: steps.probe_deploy_service_account.outputs.exists == 'false'
run: |
gcloud iam \
service-accounts create "deploy" \
--project=$PROJECT
gcloud projects \
add-iam-policy-binding $PROJECT \
--member="serviceAccount:deploy@$PROJECT.iam.gserviceaccount.com" \
--role="roles/owner" \
--project=$PROJECT
- name: 'Read the deployers list'
if: steps.access.outputs.ok == 'true'
run: |
input_file="infrastructure/deployers.txt"
output=""
while IFS= read -r line; do
trimmed=$(echo "$line" | xargs)
if [ -z "$output" ]; then
output="'$trimmed'"
else
output="$output,'$trimmed'"
fi
done < "$input_file"
echo "$output"
echo "GITHUB_USERS_STR=$output" >> $GITHUB_ENV
- name: 'Configure Workload Identity Federation for deployment'
if: steps.access.outputs.ok == 'true'
run: |
gcloud iam \
workload-identity-pools create "github" \
--location="global" \
--project=$PROJECT || true
MAPPING=(
'google.subject=assertion.sub'
'attribute.actor=assertion.actor'
'attribute.repository=assertion.repository'
);gcloud iam \
workload-identity-pools providers create-oidc "github" \
--attribute-mapping="$(echo $(IFS=,; echo "${MAPPING[*]}"))" \
--issuer-uri="https://token.actions.githubusercontent.com" \
--location="global" \
--workload-identity-pool="github" \
--project=$PROJECT || true
envsubst < infrastructure/deploy-policy.json | \
gcloud iam \
service-accounts set-iam-policy \
deploy@$PROJECT.iam.gserviceaccount.com \
/dev/stdinNote that this workflow reads the file deployers.txt with the list of GitHub usernames, each on a separate line. These usernames get whitelisted to use deploy service account. Edit this list and put your team members there.
Warning: This list is only a template for new projects. When you change it, nothing happens with the existing projects. For them, you should change the whitelists manually.
Reusable workflow: deploy
Here we do all the less-privileged tasks that do not require project-creator and can be done with deploy service account. This means they will run even in production when the access for project-creator is revoked.
The key step is calling terraform apply to set up everything that can be done declaratively.
on:
workflow_call:
inputs:
project:
required: true
type: string
project_number:
required: true
type: string
timestamp:
required: true
type: string
env:
TF_LOG: DEBUG
jobs:
_:
runs-on: ubuntu-latest
permissions:
id-token: write
steps:
- uses: actions/checkout@v4
- name: 'Export the environment variables'
run: |
echo "DART_VERSION=${{ vars.DART_VERSION }}" >> $GITHUB_ENV
echo "PROJECT=${{ inputs.project }}" >> $GITHUB_ENV
echo "PROJECT_NUMBER=${{ inputs.project_number }}" >> $GITHUB_ENV
echo "REGION=${{ vars.REGION }}" >> $GITHUB_ENV
echo "STATE_BUCKET=gs://${{ inputs.project }}-tf-state" >> $GITHUB_ENV
echo "TIMESTAMP=${{ inputs.timestamp }}" >> $GITHUB_ENV
echo "ZONE=${{ vars.ZONE }}" >> $GITHUB_ENV
- uses: 'google-github-actions/auth@v2'
with:
workload_identity_provider: 'projects/${{ env.PROJECT_NUMBER }}/locations/global/workloadIdentityPools/github/providers/github'
service_account: 'deploy@${{ env.PROJECT }}.iam.gserviceaccount.com'
- name: 'Wait until IAM policy takes effect'
uses: nick-fields/retry@7152eba30c6575329ac0576536151aca5a72780e
with:
timeout_minutes: 1
max_attempts: 5
retry_wait_seconds: 60
command: |
gcloud projects list
- name: 'Check if the state bucket for Terraform exists'
id: probe_bucket
run: |
set +e
gcloud storage buckets describe $STATE_BUCKET
[ $? -eq 0 ] && echo "exists=true" >> $GITHUB_OUTPUT || echo "exists=false" >> $GITHUB_OUTPUT
- name: 'Create the state bucket for Terraform'
if: steps.probe_bucket.outputs.exists == 'false'
run: |
gcloud storage \
buckets create $STATE_BUCKET \
--location=$REGION \
--uniform-bucket-level-access \
--project=$PROJECT
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ vars.TERRAFORM_VERSION }}
terraform_wrapper: false
- run: terraform fmt -check -recursive
continue-on-error: true
- name: 'Allocate a version'
run: |
export APP_VERSION=$(grep '^version:' capitalizer/pubspec.yaml | awk '{print $2}');
export COMMIT_HASH=$(git rev-parse HEAD)
export VERSION="v$APP_VERSION-$TIMESTAMP-$COMMIT_HASH"
echo "Version: $VERSION"
echo "VERSION=$VERSION" >> $GITHUB_ENV
- name: 'Check if the Artifact Registry repository exists'
id: probe_repository
run: |
set +e
gcloud artifacts repositories list --location=$REGION --project=$PROJECT | grep my-repository
[ $? -eq 0 ] && echo "exists=true" >> $GITHUB_OUTPUT || echo "exists=false" >> $GITHUB_OUTPUT
- name: 'Create the Artifact Registry repository'
if: steps.probe_repository.outputs.exists == 'false'
run: |
gcloud artifacts \
repositories create my-repository \
--repository-format=DOCKER \
--location=$REGION \
--project=$PROJECT
- name: 'Submit a build'
run: |
SUBSTITUTIONS=(
"_VERSION=$VERSION"
"_DART_VERSION=$DART_VERSION"
"_REGION=$REGION"
"_REPOSITORY=my-repository"
); gcloud builds \
submit \
--project=$PROJECT \
--substitutions="$(echo $(IFS=,; echo "${SUBSTITUTIONS[*]}"))" \
--config=capitalizer/cloudbuild.yaml \
capitalizer
- name: 'Terraform'
run: |
cd infrastructure/terraform
envsubst < backend.tf.template > backend.tf
terraform init
terraform apply \
-auto-approve \
-var="PROJECT=$PROJECT" \
-var="REGION=$REGION" \
-var="ZONE=$ZONE" \
-var="VERSION=$VERSION"Note that at the beginning we need to wait until the access policy takes effect. When you run commands that grant access, they are not effective immediately. This was not noticeable in a terminal, but GitHub workflows run fast, and this delay can break the workflow if we don’t wait.
For this, we use a GitHub action that retries commands. Since it’s a third-party action from an unknown publisher, we use the specific commit hash to make sure we don’t pick a future malware version. Always use do this for security. Wish we had a built-in action from GitHub for this purpose and did not have to hardcode the version.
Reusable workflow: test
This one is trivial. The only notable thing is getting the container output regardless of the test result, which can help in debugging the issues.
on:
workflow_call:
inputs:
project:
required: true
type: string
project_number:
required: false
type: string
jobs:
_:
runs-on: ubuntu-latest
permissions:
id-token: write
steps:
- uses: actions/checkout@v4
- name: 'Export the environment variables'
run: |
echo "DART_VERSION=${{ vars.DART_VERSION }}" >> $GITHUB_ENV
echo "PROJECT=${{ inputs.project }}" >> $GITHUB_ENV
echo "PROJECT_NUMBER=${{ inputs.project_number }}" >> $GITHUB_ENV
- uses: 'google-github-actions/auth@v2'
with:
workload_identity_provider: 'projects/${{ env.PROJECT_NUMBER }}/locations/global/workloadIdentityPools/github/providers/github'
service_account: 'deploy@${{ env.PROJECT }}.iam.gserviceaccount.com'
- uses: 'google-github-actions/setup-gcloud@v2'
- uses: dart-lang/setup-dart@v1
with:
sdk: ${{ vars.DART_VERSION }}
- name: 'Test'
run: |
cd capitalizer
dart pub get
dart test
- uses: 'google-github-actions/get-gke-credentials@v2'
if: always()
with:
cluster_name: 'my-cluster'
location: ${{ vars.ZONE }}
project_id: ${{ env.PROJECT }}
- name: 'Container output'
if: always()
run: |
kubectl get pods
kubectl logs $(kubectl get pods -o name | grep dart-pubsub-gke-demo | head -n 1)Reusable workflow: delete
This one is also trivial. It deletes the project. But first, it attempts terraform destroy. In this example, it’s not necessary when shutting down a project because a Google Cloud project will take down all the resources with it. However, a failure of terraform destroy may indicate configuration errors which can prevent the proper update of permanent environments, so it’s an important step.
Note that we use project-creator service account again. This is for reliability. If there were any errors creating deploy account, this step can still delete the project successfully. Additionally, this will prevent the accidental deletion of the production project because project-creator can’t access it.
on:
workflow_call:
inputs:
project:
required: true
type: string
secrets:
MASTER_PROJECT:
required: true
MASTER_PROJECT_NUMBER:
required: true
jobs:
_:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: 'Export the environment variables'
run: |
echo "PROJECT=${{ inputs.project }}" >> $GITHUB_ENV
echo "REGION=${{ vars.REGION }}" >> $GITHUB_ENV
echo "ZONE=${{ vars.ZONE }}" >> $GITHUB_ENV
- uses: 'google-github-actions/auth@v2'
with:
workload_identity_provider: 'projects/${{ secrets.MASTER_PROJECT_NUMBER }}/locations/global/workloadIdentityPools/github/providers/github'
service_account: 'project-creator@${{ secrets.MASTER_PROJECT }}.iam.gserviceaccount.com'
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ vars.TERRAFORM_VERSION }}
terraform_wrapper: false
- name: 'Terraform'
run: |
cd infrastructure/terraform
envsubst < backend.tf.template > backend.tf
terraform init
terraform destroy \
-auto-approve \
-var="PROJECT=$PROJECT" \
-var="REGION=$REGION" \
-var="ZONE=$ZONE" \
-var='VERSION=""'
- name: 'Delete the project'
if: always()
run: |
gcloud projects delete $PROJECT --quietWiring the pieces together and deploying
1. Come up with a production project name.
2. On GitHub, go to Environments and create a new one named prod.
3. Add PROJECT environment variable with your production project ID:
4. Run deploy_prod workflow. Give it about 20 minutes to set everything up.
5. Find the project number in Google Cloud console or in the run output:
6. Add that number as PROJECT_NUMBER environment variable.
Subsequent runs will be faster:
Protecting the production environment
Revoke the access from project-creator
The production Google Cloud project is initially created with the first run of deploy_prod workflow. Hence, project-creator service account has the Onwer role on it.
This can be abused by anyone who is whitelisted to access this service account for the innocent purpose of creating transient stages.
Revoke this access in Google Cloud console.
Set the whitelist for deploy
When the production project is created, the whitelist of GitHub users who can use deploy service account is taken from the file deployers.txt in the repository. You normally put there all the developers who need to set up transient stages. For production, this list should likely be shorter.
Sign in as your normal Google account and run the following for the production project:
export GITHUB_USERS=(
"'username1'"
"'username2'"
)GITHUB_USERS_STR=$(IFS=,; echo "${GITHUB_USERS[*]}") \
envsubst < infrastructure/deploy-policy.json | \
gcloud iam \
service-accounts set-iam-policy \
deploy@$PROJECT.iam.gserviceaccount.com \
/dev/stdinDelete all service account keys
Throughout this tutorial, we created a few keys for service accounts for experimenting in a terminal. Delete these keys in Google Cloud console. Deleting just the exported JSON keys is not enough.
Using Workload Identity Federation with Dart
It’s tricky and does not work out of the box.
Clients connect to Google Cloud using short-lived access tokens. These tokens can be produced in various ways with various scenarios of authentication.
The package googleapis_auth is used to authenticate to Google Cloud. It supports the following ways to authenticate:
- Service account keys. If there is an environment variable
GOOGLE_APPLICATION_CREDENTIALS, the SDK reads the key from the file that this variable points to. A long-lived service account key is used to obtain a short-lived access token, the SDK does this for us. We were using this scenario in the previous article when running the test in the command line and when running the app in Kubernetes with the key provided as a secret. - Metadata server. If the variable is not set, the SDK attempts to connect to a server at the address
metadata.google.internal. This lookup of a non-existent domain name only works inside GKE nodes and other Google Cloud environments. This server trusted by Google Cloud provides the SDK a token to access Google Cloud resources. We are using it in the setup in this article for the app that runs in GKE.
However, when we run a test (or any other Dart app) from GitHub workflow directly, it’s different. The variable GOOGLE_APPLICATION_CREDENTIALS exists but points not to a service account key but to a different type of key specific to Workload Identity Federation. More elaborate SDKs like JavaScript, Python, Go, and others can use those keys to obtain access tokens, but the Dart SDK can’t.
As a workaround, I created the package wif_workaround. It uses gcloud command to obtain an access token from this type of key. This is why you can see the prefix w. when creating the client in main_test.dart:
import 'package:wif_workaround/wif_workaround.dart' as w;
// ...
final pubsub = PubSub(
await w.clientViaApplicationDefaultCredentials(scopes: PubSub.SCOPES),
projectId,
);And this is why we install gcloud in test workflow shown earlier. In my experiment, this adds 16 seconds to the workflow run time.
If you want the Dart SDK for Google Cloud to handle such keys natively, upvote this issue because this is how the Dart team at Google prioritizes their work.
Further optimization
Run on pull requests
If you want, you can configure stage_delete workflow to run on each pull request. Here is how to do this.
However:
- The workflow is terribly inefficient at this point. It takes 27 minutes in my experiment, with GKE cluster setup and destroying taking most of that time.
- If you mess something up, you can accidentally leave a GKE cluster working, which will cost over 100 dollars per month per project. Mess up 10 pull requests, and you will have runaway bills.
That’s why I suggest sticking to manual runs for transient stages until you are familiar and comfortable with the CI/CD. By familiar and comfortable I mean not needing this tutorial.
A bit safer is to have a persistent project for testing and deploy all pull requests to it. Concurrent runs may break the project, so you may need to implement some lock. This should not be a problem because deploy_prod takes just over 3 minutes in my experiment.
Build faster
Syntax errors are the most common ones. Currently, to detect them in the workload, we create a project and do a lot of fuss with it. While this is the best isolation, an obvious optimization is to build images in a shared project.
This involves more complex permissions and attention to removing old images. Currently, we don’t care about them because transient projects with images are deleted.
Use a shared GKE cluster
We can cut a 27-minute run down to 3 minutes if we skip the cluster creation and deletion. This involves even more complex permissions.
We will do all of that and more in the final article in this series.
To make sure you get notified, follow my Telegram channel: ainkin_com
The importance of DevOps
What we did was a hello-world DevOps setup, and you can’t drop much from it. Yet it’s 20 times larger than our hello-world workload, see the line count:
And most of the production DevOps out there is much less clear, reliable, and automated than what we did here. This makes DevOps the single most underrated IT skill in my opinion.
Never miss a story, follow my Telegram channel: ainkin_com
