CloudRun Canary Releases with Terraform

Gloster Canary, image credit petguide.com

Intro

In this how-to post, I’ll illustrate an implementation of the idea for the GCP CloudRun service, that will be managed with terraform.

Implementation

The implementation is based on the CloudRun(CR) feature to serve traffic coming to a service from multiple — two, in the canary case — versions of the service configurations known as revisions.

From the terraform side, google_cloud_run_service resource is the only way[3][4] to control a CR service. Traffic served by a service is controlled with a traffic block, so terraform configuration of a service in a normal state looks like this:

Image 1. Normal state of a service
# for illustration purposes the rest of the code is omitted 
traffic {
percent = 100
latest_revision = true
}

autogenerate_revision_name = true

The canary state with two different revisions serving different traffic percentage is represented by a config like this:

traffic {
percent = 90
revision_name = live_revision_name
}

traffic {
percent = 10
revision_name = canary_revision_name
}

Simple as this static configuration may appear, its dynamic equivalent comes with complications that arise from the fact that the multiple revisions feature is not a first-class citizen in the google terraform provider:

  1. Autogenerated revision names can no longer be used
  2. The google_cloud_run_service resource operates a single revision per deployment: how does one manage and reference two at the same time?
  3. A mechanism to handle the two states is required:
    – Control the current state
    – Control traffic split percentage between revisions in the canary state

To address revision name-related problems, let’s take a look at how a revision name is set within the resource:

resource "google_cloud_run_service" "service" {
name = my_service_name
template {
metadata {
name = revision_name
}
}
}

With the following considerations in mind:

Revision name must be prefixed with the service name it belongs to, and

There should be a distinction between live and canary versions, and

A revision name can’t be reused to deploy a new application version — it has to be unique,

Revisions can be named following a pattern like this:

service_name + changing_part (commit, application version etc. ) + live/canary identifier e.g.:

Image 2. New naming pattern
locals {
rev_name_live = my_service_name_${var.app_version_live}_live
rev_name_canary = my_service_name_${var.app_version_canary}_canary
}

resource "google_cloud_run_service" "service" {
name = my_service_name
template {
metadata {
name = local.rev_name_live
}
}
traffic {
percent = 100
revision_name = local.rev_name_live
}
}

With the revision names point now figured out, how can we actually manage and use multiple revisions at the same time? Knowing that:

  1. We need a new revision name for canary — it can be formed anytime as the naming pattern is now known, and
  2. Revision itself should physically exist — it needs to be deployed first,

We introduce a canary switch — e.g. a bool variable canary_enabled, that controls the state, and a dynamic traffic block that is conditionally created when the service is in the canary state:

variable "canary_enabled" {
description = "Canary switch"
type = bool
}

locals {
rev_name_live = my_service_name_${var.app_version_live}_live
rev_name_canary = my_service_name_${var.app_version_canary}_canary
}

resource "google_cloud_run_service" "service" {
name = my_service_name
specs {
containers {
# if canary is enabled, deploy a canary image
image = var.canary_enabled ? var.canary_image_name : var.live_image_name
}
}
template {
metadata {
# if canary is enabled, deploy a canary revision
name = var.canary_enabled ? local.rev_name_canary : local.rev_name_live
}
}

traffic {
# live serves 100% by default. If canary is enabled, this traffic block controls canary
percent = var.canary_enabled ? local.canary_percent : 100
# revision is named live by default. When canary is enabled, a new revision named canary is deployed
revision_name = var.canary_enabled ? local.rev_name_canary : local.rev_name_live
}

dynamic "traffic" {
# if canary is enabled, add another traffic block
for_each = canary_enabled ? [canary] : []
content {
# current live's traffic is now controlled here
percent = var.canary_enabled ? 100 - var.canary_percent : 0
revision_name = var.canary_enabled ? lovcal.rev_name_live : local.rev_name_canary
}
}
}

With the switch off, the configuration has only one traffic block that controls the current live revision. Once switched on, and here’s the trick to manage two revisions at the same time, the current live traffic is controlled in the dynamic block, while the original traffic block manages the canary revision.

Let’s toggle the switch:

# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"

~ template {
~ metadata {
~ name = "example-cr-service-9da0c6d-live" -> "example-cr-service-06dccff-canary"
}

~ spec {
~ containers {
~ image = "us.gcr.io/project/repo:9da0c6d1cddb26b913859e48e2ccdeba3c0ee596" -> "us.gcr.io/project/repo:06dccffbcd0fe15807feed60c51bd03dc12eef18"
# (2 unchanged attributes hidden)
}
}
}

~ traffic {
~ percent = 100 -> 10
~ revision_name = "example-cr-service-9da0c6d-live" -> "example-cr-service-06dccff-canary"
# (1 unchanged attribute hidden)
}
+ traffic {
+ percent = 90
+ revision_name = "example-cr-service-9da0c6d-live"
}
}
Image 3. Rolling out a canary version

Note another added variable canary_percent, that we can now use to control the traffic split between the two revisions:

 # module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"

~ traffic {
~ percent = 10 -> 20
# (2 unchanged attributes hidden)
}
~ traffic {
~ percent = 90 -> 80
# (2 unchanged attributes hidden)
}
}
Image 4. Shifting the traffic percentage between the versions

The original problems are now tackled and a canary revision can be conditionally deployed to serve an adjustable portion of the traffic the service receives alongside the currently running live version, both controlled with the same terraform resource. A revision’s name contains a changing part, what allows for new revision names generation.

Let’s now promote the canary version to live by updating the live_image_name variable and toggling the canary switch off:

# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"

~ template {
~ metadata {
~ name = "example-cr-service-06dccff-canary" -> "example-cr-service-06dccff-live"
# (3 unchanged attributes hidden)
}
}

~ traffic {
~ percent = 20 -> 100
~ revision_name = "example-cr-service-06dccff-canary" -> "example-cr-service-06dccff-live"
# (1 unchanged attribute hidden)
}
- traffic {
- latest_revision = false -> null
- percent = 80 -> null
- revision_name = "example-cr-service-9da0c6d-live" -> null
}

}
Image 5. Promoting the canary version to live

Random postfix

variable "live_image" {
description = "Live image name"
type = string
}

variable "max_scale" {
description = "Cloud Run service max number of instances"
type = number
}

variable "canary_image" {
description = "Canary image name"
type = string
}

variable "canary_enabled" {
description = "Canary switch"
type = bool
}

variable "canary_percent" {
description = "Percent of traffic canary revision will get"
type = number
}

variable "force_new_revision" {
description = "Dummy variable to trigger a new revision name"
type = bool
}

resource "random_string" "rev_name_postfix_live" {
# it gets updates on changes to the following 'keepers' - properties of a service
keepers = {
image_name = var.live_image_name
max_scale = var.max_scale

force_new_revision = var.service.force_new_revision
}
length = 2
special = false
upper = false
}

resource "random_string" "rev_name_postfix_canary" {
keepers = {
canary_enabled = var.canary_enabled
canary_image_name = var.canary_image_name
}
length = 2
special = false
upper = false
}

locals {
rev_name_live = my_service_name_${var.app_version_live}_live_${rev_name_postfix_live.rev_name_postfix.result}
rev_name_canary = my_service_name_${var.app_version_canary}_canary_${rev_name_postfix_canary.rev_name_postfix.result}
canary_percent = var.service.canary.percent
}

Upon toggling the canary_enabled switch in the config file, we’ll see the random part in name is regenerated with the rev_name_postfix_canary resource recreated in response to a change in its keepers, what then triggers a new CR revision creation:

# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"
# (4 unchanged attributes hidden)

~ template {
~ metadata {
~ name = "example-cr-service-8f09270-live-dw" -> (known after apply)
# (3 unchanged attributes hidden)
}
}

~ traffic {
~ percent = 100 -> 10
~ revision_name = "example-cr-service-8f09270-live-dw" -> (known after apply)
# (1 unchanged attribute hidden)
}
+ traffic {
+ percent = 90
+ revision_name = "example-cr-service-8f09270-live-dw"
}

# (2 unchanged blocks hidden)
}

# module.example.random_string.rev_name_postfix_canary must be replaced
-/+ resource "random_string" "rev_name_postfix_canary" {
~ id = "vw" -> (known after apply)
~ keepers = { # forces replacement
~ "canary_enabled" = "false" -> "true"
# (1 unchanged element hidden)
}
~ result = "vw" -> (known after apply)
# (9 unchanged attributes hidden)
}
Image 6. Random part in a revision name

Additionally, we can add a special keeper force_new_revision that allows for triggering a new revision creation when there are no changes to the service properties, what may be useful in some cases.

Canary identifier

# canary identifier env var
dynamic "env" {
for_each = var.service.canary.enabled ? {"CANARY" = 1} : {}
content {
name = env.key
value = env.value
}
}
 ~ spec {
~ containers {
# (3 unchanged attributes hidden)

+ env {
+ name = "CANARY"
+ value = "1"
}
}
}

Rollback

# module.example.google_cloud_run_service.service will be updated in-place
~ resource "google_cloud_run_service" "service" {
id = "locations/us-west2/namespaces/project/services/example-cr-service"
name = "example-cr-service"

~ template {
~ metadata {
~ name = "example-cr-service-06dccff-canary-rr" -> "example-cr-service-7b2b51c-live-8g"
}
}

~ traffic {
~ percent = 10 -> 100
~ revision_name = "example-cr-service-06dccff-canary-zh" -> "example-cr-service-7b2b51c-live-8g"
# (1 unchanged attribute hidden)
}
- traffic {
- latest_revision = false -> null
- percent = 90 -> null
- revision_name = "example-cr-service-7b2b51c-live-8g" -> null
}

# (2 unchanged blocks hidden)
}

# module.example.random_string.rev_name_postfix_canary must be replaced
-/+ resource "random_string" "rev_name_postfix_canary" {
~ id = "zh" -> (known after apply)
~ keepers = { # forces replacement
~ "canary_enabled" = "true" -> "false"
# (1 unchanged element hidden)
}
~ result = "zh" -> (known after apply)
}
Image 7. Canary revision has been rolled out
Image 8. The revision has been rolled back

Conclusion

I hope the post gives you some practical knowledge on how a canary deployment can be implemented for GCP CloudRun service fully managed with terraform.

Happy canarying.

Notes

  1. Check out the official guide. There are also tags used, that can be easily added to the implementation if needed https://cloud.google.com/architecture/implementing-cloud-run-canary-deployments-git-branches-cloud-build
  2. See this post on CloudRun Release Manager. I did not use it for a number of reasons, but it’s worth checking out and may fit your particular case https://medium.com/google-cloud/automatic-release-propagation-for-canary-releases-with-cloud-run-1ccc2ec74c7f
  3. There is also this v2 resource that does not seem to address the issue https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/cloud_run_v2_service
  4. It wouldn’t hurt to have a separate resource for revisions. Thumbs up https://github.com/hashicorp/terraform-provider-google/issues/10095

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store