Google Cloud Build is a service for provisioning build agents to run Continuous Integration and Deployment (CI/CD) tasks . The image above is an architecture of a CI/CD pipeline for deploying a Kubernetes application. This article summarizes all the components and actions involved in setting up this Cloud Build-based CI/CD pipeline. It provides a 1000ft view without going into the step-by-step details of any one specific task.
There are lots of articles about doing specific tasks, but I seldom see ones describing the big picture of what is required to set up an end-to-end solution. I believe that if you understand what you are building, it becomes easier to source individual parts…not the other way round.
Before we proceed, I’ll be covering intermediate to advanced level concepts on Kubernetes, Infrastructure Security, and DNS, etc. So I assume familiarity with these concepts or that you can read up on them later. The entire infrastructure is provisioned with Terraform  and all source code is available on Github . Let’s dive in.
Setting up a CI/CD pipeline for Kubernetes applications using Google Cloud Build - eyeezzi/k8s-cloud-build-cicd
A note on the multi-cloud architecture
Notice that the system is grouped into 4 sections, each provisioned on a different cloud provider as follows: Github for source code repository; Google Cloud Platform for build agents and container registry; DigitalOcean for Kubernetes clusters; and Namecheap for DNS.
Having various components split across different providers makes inter-operation a bit more challenging, but it also provides benefits like using the most cost-effective provider for a specific service, avoiding vendor lock-in, and getting a deeper understanding of how various pieces fit together. The latter being my main reason for designing the architecture this way.
The Change Management Process
All CI/CD pipelines aim to solve one core problem: how to enable developers to incrementally update their software without breaking the users’ experience.
A common idea is to roll-out changes first to a Staging environment, then promote that change to a Production environment. In my case, the triggers for moving changes from one environment to another are Git operations (like merging a Pull Request)…a principle known as GitOps .
The diagram below shows how our system will propagate change from commit to production.
- First, the Developer creates 3 Triggers on Cloud Build, one for each pipeline.
- Creating a Pull Request triggers the Development Pipeline which runs tests and reports a Commit Status to Github. Assuming all tests pass, the PR is approved and merged into master.
- Changing master triggers the Staging Pipeline which builds a Docker image, pushes the image to a Google Container Registry and deploys the application to the Staging Cluster (by applying its K8s YAML files). Then developers can sanity-check the app running on Staging.
- If deemed correct, the master commit running on Staging can be tagged with a version like v1.1.2. This action triggers the Production Pipeline which applies the K8s files to the Production Cluster.
With that, let’s walk through each piece of implementing the pipeline.
Setting up the Github Repository
The repository contains 3 Cloudbuild YAML files: app/dev.cloudbuild.yaml , app/staging.cloudbuild.yaml , and app/prod.cloudbuild.yaml . Each specifies the steps to be executed by the Development, Staging, and Production pipelines respectively.
Also, we commit 2 KUBECONFIG files: app/k8s/staging.kubeconfig.yaml and app/k8s/prod.kubeconfig.yaml , containing the connection parameters to access the Staging and Production clusters respectively. Each file is used by the corresponding pipeline to deploy the application to the cluster. Importantly, the User Token is removed from the KUBECONFIG files and provided to each Cloud Build pipeline as an environment variable. So, it is safe to commit these files to source control.
Setting up Google Cloud Build
From the GCP Console, We connect Cloud Build to our Github repository by going through an OAuth flow that installs and authorizes the Cloud Build Github App to watch and report changes on our repository . Alternatively, we could mirror the Github repository to a Cloud Source Repository, and have Cloud Build watch the mirror…but this is complicated and slow .
Then, we create 3 Triggers, each configured to run one of our 3 pipelines :
- Feature Branch Trigger: Watches Pull Requests, and runs the steps in app/dev.cloudbuild.yaml.
- Deploy to Staging Trigger: Watches the master branch and runs the steps in app/staging.cloudbuild.yaml.
- Deploy to Prod Trigger: Watches for Tags of the form v1.2.3 and runs the steps in app/prod.cloudbuild.yaml.
Also, you can configure a Trigger to expose certain environment variables to the pipeline processes . So, we set the environment variable
_DEPLOYER_TOKEN=<User Token> in the Deploy to Staging and Deploy to Prod triggers.
Lastly, Cloud Build comes with a default Service Account authorized to push images to the GCR registry of the same project . However, our DigitalOcean clusters lack such authorization to pull images from the Registry. So, we create a dedicated GCP IAM Service Account with Storage Object Viewer role on the GCP Project. Then we download a JSON Key for this Service Account …later this file will be used to create an ImagePullSecret in both Staging and Production clusters for pulling images from the registry.
Setting up the Kubernetes Clusters
We set up 2 Kubernetes clusters for the Staging and Production environments respectively . Each Application will be deployed to its own Namespace…virtually isolating it from other applications in that environment. All the resources describing the application — Deployment, Service, and Ingress, etc — will be deployed to that namespace.
When a Deployment is created in a namespace, the Kubelet needs the authorization to pull images from GCR. So, we create an ImagePullSecret from the GCP Service Account Key which we downloaded earlier. Then add this secret as an imagePullSecret to the default Service Account in the cluster namespace . Consequently, every Deployment in that namespace can pull images from GCR.
Finally, we are run a Reverse Proxy in each cluster to load-balance incoming requests to the appropriate application service. Traefik is one such reverse proxies with support for automatically provisioning Letsencrypt TLS Certificates . Importantly, Traefik is run as a LoadBalancer-type K8s service, so that the Cloud Provider (in this case DigitalOcean) provisions an actual load balancer device with a public static IP address . You can then access the cluster — and consequently your application — through that IP address.
Setting up DNS
Accessing your services with an IP address is impractical, so purchase a Domain Name from a provider like Namecheap — for this project, I purchased opsolute.com. DNS queries for domains with that suffix will be sent to Namecheap nameservers.
In order to redirect such requests to our DigitalOcean clusters, we add NS Records to the Namecheap account to delegate query resolution to the DigitalOcean nameservers . Then we create 2 A Records that map to the Static IP addresses of our Staging and Production clusters respectively.
- Namecheap NS Records: opsolute.com. → DigitalOcean Nameservers
- DigitalOcean A Record: polite.opsolute.com. →126.96.36.199 (Production Cluster Traefik LoadBalancer IP) → Application
- DigitalOcean A Record: staging.polite.opsolute.com. → 188.8.131.52 (Staging Cluster Traefik LoadBalancer IP) → Application
Testing the entire setup
Our sample application is a Go server called
polite which returns a greeting on the
/ route, and health status on the
/health route. You can customize the greeting by setting the
--greeting flag in the Deployment file . We’ll deploy different versions of our application to the Staging and Production environment with different messages. Follow the steps below to test the overall pipeline.
- Edit the Deployment file to set
--greeting="Bonjour!". Make a Pull Request for this change and watch the Cloud Build Development Pipeline kick-in to run tests and report a success status. Then approve and merge the PR.
- On merging to master, the Staging Pipeline should activate to build and deploy the application image to the Staging Cluster.
- From your browser visit the staging URL https://staging.polite.opsolute.com and you should see the greeting Bonjour!.
- Now tag the master branch like so:
git tag v1.0.0 origin/master && git push --tags. This should trigger the Deploy to Prod Pipeline which will deploy the image to the Production cluster. Visit the URL https://polite.opsolute.com and you should see the greeting Bonjour!.
We now have one version of our app running on both Staging and Production. Next, we introduce a change that makes our Spanish, then deploy that change to Staging.
- Repeat steps 1 and 2 above but this time, set
- Visit https://staging.polite.opsolute.com and you should see Hola!. However, the Production URL https://polite.opsolute.com should still return Bonjour!.
And that’s it! Our pipeline can deploy different versions to Staging and Production.
Challenges I faced
I enjoyed working on this project, but it wasn’t without its fair share of frustrations. Some I’ll briefly describe below.
- Every time you deploy a K8s LoadBalancer service, DigitalOcean provisions a Load Balancer with a random public static IP address. So you need to update your DNS A Records to point to this new IP. This is very inconvenient. The solution would be to purchase what they call a Floating IP (a static IP with a different lifecycle from your service) and specify this as the value of your service’s LoadBalancerIP…but this is sadly not supported .
- Cloud Build does not support triggering on Github Pull Merge branches. Whenever you make a Pull Request, Github simulates the resulting merger of the source and destination branches in a branch named
refs/pull/ID/merge. Hence, you can test and build the result of merging your PR without actually merging it. Sadly this is not supported by GCP Cloud Build .
- I tried using the Kubernetes Provider and the Helm Provider to manage Kubernetes resources in Terraform. The aim was to declaratively manage important cluster resources like Namespaces and the Traefik ingress controller as Infrastructure as Code. However, these providers had poor support for Kubernetes object specifications and had resource-dependency issues. So for now, it’s best to keep Kubernetes stuff outside Terraform until these providers become more mature.
Overall, I found Cloud Build great for quickly setting up CI/CD pipelines for Kubernetes applications.
: Cloud Build — Automated builds for continuous integration | Cloud Build | Google Cloud
: Terraform by HashiCorp
Eyeezzi — https://github.com/eyeezzi/k8s-cloud-build-cicd
: GitOps what you need to know
: Running builds with GitHub Checks | Cloud Build Documentation | Google Cloud
: Mirroring a GitHub repository | Cloud Source Repositories Documentation | Google Cloud
: Automating builds with Cloud Build | Cloud Source Repositories Documentation | Google Cloud
: Substituting variable values | Cloud Build | Google Cloud
: Setting service account permissions | Cloud Build | Google Cloud
: Creating and managing service account keys | Cloud IAM Documentation | Google Cloud
: How to Create Kubernetes Clusters Using the Control Panel
: Google Cloud Registry (GCR) with external Kubernetes
: Let’s Encrypt
Containo.us — https://docs.traefik.io/https/acme/
: How to Add Load Balancers to Kubernetes Clusters
: How To Point to DigitalOcean Nameservers From Common Domain Registrars
DigitalOcean — https://www.digitalocean.com/community/tutorials/how-to-point-to-digitalocean-nameservers-from-common-domain-registrars
: How to set static IP for loadbalancer in Kubernetes?
DigitalOcean — https://www.digitalocean.com/community/questions/how-to-set-static-ip-for-loadbalancer-in-kubernetes