How we revamped our GCP environment @Strise

Published in

Strise

6 min readAug 1, 2022

We’ve been using Google Cloud Platform since day one — even when Strise was just a research project. Most of our GCP knowledge is based on documentation written by Google and learning by doing while delivering on our milestones.

From the day zero, our GCP environment had consisted of a single project, famously called ntnu-smartmedia. That project included all our IAM roles, firewall rules, VMs, clusters, databases, FTP servers and so on. When scaling in terms of the number of engineers and interns, having control of all IAM roles without any centralized tool for access control, became a major pain point. It didn’t help much that nothing was set up using infrastructure as code, leading to a lot of point and click, and different attributes and settings for all resources that weren’t created programmatically. The larger we grew, the clearer was the goal:

Separate our Google Cloud environment into multiple projects with clear separation of privileges using infrastructure as a code.

When we kicked off the project, my estimate was at about 2–3 months of time. However a couple of major revelations were discovered when coding the infrastructure and doing necessary adjustments to applications and pipelines:

Our ntnu-smartmedia project ID had been hardcoded literally everywhere. Every service, every pipeline and each bash script had at least one hardcoding referencing that name.

All of our infrastructure was deployed on the Legacy VPC network without the ability for VPC peering. This made the process of moving clusters one by one impossible as we highly utilize internal networking without the possibility to implement external load balancers and ingresses.
We decided to upgrade from Helm 2 to Helm 3 in the process
We decided to move our secrets to GCP Secret Manager behind a CMEK
We store a lot of data inside Cloud Storage. We didn’t have any environment specific buckets and were creating only folders for each environment inside common buckets made for a specific purposes
Many of our services didn’t support Workload Identity (or the identity had a hardcoded suffix referencing one particular ntnu-smartmedia service account inside the Helm chart). That particular service account had access to all the buckets and environments leading to nonexistent separation of duties 🚨
We decided to implement a VPN solution that supports Google SSO so we wouldn’t have to manage our VPN keys and accesses.

Organisation

We decided to split the setup into four projects — one for supporting applications which we only have one of, and one for each environment. In addition to that, we have additional projects hosting our KMS, secrets and our terraform factory.

Networking

Since we ended up splitting environments into different projects, Shared VPC in the new setup was a no-brainer. We were now entering something that was new to us — concept of subnets 🤩 That’s why, each project would have its own subnet with firewall rules preventing the non-production environments from accessing production, with own IP range, with private google access (spoiler alert!) enabled.

If the architecture get more advanced in the future, we might consider having separate networks for different pieces of our solutions with network peering in between.

GKE

Google Kubernetes Engine is at the core of our business — we are hosting everything from critical components of Strise processing pipeline, Elastic stack and our frontend application, to supporting services like nginx proxies or cron-jobs.

So when creating a new setup, we decided to opt-in for Private GKE clusters. What does it mean? This allows us to limit our GKE exposure from approximately 200 public IP addresses towards 4 (one for each external ingress controller) instead of having one external IP for each VM attached to our cluster.

A private cluster is a type of VPC-native cluster that only depends on internal IP addresses. Nodes, Pods, and Services in a private cluster require unique subnet IP address ranges.

However, one of the greatest improvements when opting in for Private GKE clusters is that we can easily control which CIDR ranges have access to control planes of our kubernetes clusters. And yes — there’s no way to get there being outside our network.

Cloud Storage

Regarding Cloud Storage, setting up uniform bucket-level access was a no-brainer. The reason is simple — it’s way easier to control access on a bucket level than access to a single file inside complex folder structure. I’ve found some shady HTML files from our CTO’s LinkedIn profile being public inside one of our buckets, giving the world access to parts of the LinkedIn messaging history from his account.

“it is possible to see your chats from your LinkedIn profile”

In addition to that, we ended up creating one bucket for each workload that needs a bucket inside respective environments achieving separation of duties between environments and service accounts.

¡No more of one single service account manipulating data in all envs!

The grand plan

Since legacy VPC network could not be peered with our new shared VPC network, we identified two possible alternatives:

Use the conversion tool developed by Google and peer the converted network. This would allow us to move workloads one by one and verify them continuously.

Major red flag of this approach was that the tool hasn’t been released for GA. This means that it could potentially lead to major downtime for our application.

Recreate the environment using another domain and perform the migration by adjusting the DNS entries.

Major disadvantage of this approach is that it costs more:

All pipelines have to be adjusted in order to deploy among our two environments.
Storage transfer service would have to be used to duplicate data between buckets and environments.
Pub/Sub messages have to be republished to new topics within new projects
We’d have to stop the legacy environment for some time before bringing it back up on the other side restoring our databases to the state right before the stop as we want to avoid setting up Neo4j and Elasticsearch replication.

Even though I consider myself as a very risk tolerant person, in case of a networking conversion failure and without a backup solution, we’d pretty much end up with a solution that wasn’t available for our customers for an extended period of time. This could potentially cost more than time investment put into duplicating the environment. For me (as a pretty recent addition to Strise tech stack), it was a great way to get to know the setup.

Prior to the migration, we had set up a Notion document containing all the tasks that needed to be performed, everything from backing up the old DNS entries to having a gin tonic post migration. If you’re interested in the details, feel free to reach out!

The results

We can spin up the whole environment containing approximately 20 Cloud Storage buckets, 1 kubernetes cluster containing necessary node pools, 20 service accounts and over 100 IAM bindings, and more, in about 10~15 minutes while ensuring that the setup has correct settings, labels, networking, and principle of least privilege applied to it.

When a developer needs to perform a change to our infrastructure, e.g. add a new service that stores data inside Cloud Storage, we have full control of the flow via GitHub. Thar person only needs to add two lines of code defining the namespace and name of the bucket, and we will ensure that GCP IAM Service Account, Workload Identity IAM binding and Cloud Storage bucket deployed in the correct region with necessary permissions get provisioned.

Road ahead

Since we also upgraded our Linkerd version during the migration, we will most probably consider enabling multi-cluster functionality 🔗
We are considering Atlantis for Terraform automation 🏝
We will be revisiting our Jenkins pipelines and possibly testing out new CD tooling — ArgoCD? Keptn? 👻
We are working on becoming SOC2 compliant 🔒
We are considering tools for monitoring and alerting as code 🚨