Cross-project, cross-VPC communication with GKE Multi-Cluster Services

Deniz Zoeteman
The Zeals Tech Blog
5 min readDec 22, 2022

For a migration at Zeals, we needed a way to reach GKE services on different VPCs between projects. While Google’s Multi-cluster services (MCS) project seemed like the right solution for us, there’s only official documentation for either a single project or shared VPC architecture. As such, we had to figure some of the steps out ourselves, and some with help from Support. Hopefully this article can bring clarity to you on the how and limitations of this approach.

· Architecture overview
· Deploying MCS
· Testing time!
· Troubleshooting
· Limitations
· Conclusion

Architecture overview

For this example, we have two projects, project1 project2 , and their respective VPCs and GKE clusters. What we’d like to achieve is to get the pods of service2 in project2 to communicate with service1 in project1. Note that this setup does not actually give us the full capabilities of MCS (see limitations), but it is enough for what we need: namely service discovery.

For MCS, you need to set up a GKE Hub fleet. We’ll have to decide a host for this fleet; in this case, that’ll be project1.

MCS uses Traffic Director as a state; this is where the backend pods are listed and MCS will know if they’re healthy.

Once we register the clusters with GKE Hub , GKE will automatically set up the MCS Importer in these clusters. This deployment is what actually keeps the service IPs and everything else synced.

Deploying MCS

Now, I will go over the steps to deploy MCS in this architecture. This is a combination of both of the steps from the Shared VPC documentation (as some steps are the same), as well as the steps we had to figure out ourselves.

Networking
To communicate between clusters across VPCs, we will need some kind of connection between these VPCs. The best solution in this case is VPC peering.

First set up the peering connection in project1:

gcloud compute networks peerings create mcs-peering \
--project project1 \
--network vpc1 \
--peer-project project2 \
--peer-network vpc2

And then set up the connection on the other side as well to activate it:

gcloud compute networks peerings create mcs-peering \
--project project2 \
--network vpc2 \
--peer-project project1 \
--peer-network vpc1

You might also need to change firewall rules to allow traffic between the VPCs, but I’ll leave that out of this article as it will be dependent on your setup.

Enable APIs
We’ll need to enable a couple required APIs for MCS to function. As project1 is our GKE Hub fleet’s host, we should only enable the gkehub API there.

gcloud services enable gkehub.googleapis.com \
dns.googleapis.com \
trafficdirector.googleapis.com \
cloudresourcemanager.googleapis.com \
multiclusterservicediscovery.googleapis.com \
--project project1

gcloud services enable dns.googleapis.com \
trafficdirector.googleapis.com \
cloudresourcemanager.googleapis.com \
multiclusterservicediscovery.googleapis.com \
--project project2

Enable MCS
Let’s now enable MCS on the fleet host, project1:

gcloud container fleet multi-cluster-services enable --project project1

Set up IAM policy bindings
For MCS to function, we need a couple different IAM policy bindings on project-level.

Since the GKE hub fleet host is project1 , we don’t need to configure bindings for GKE Hub and MCS there. But, for GKE Hub/MCS to access the cluster in project2 , we’ll need to add some bindings:

gcloud projects add-iam-policy-binding project2 \
--member "serviceAccount:service-<project1 number>@gcp-sa-gkehub.iam.gserviceaccount.com"
--role roles/gkehub.serviceAgent
gcloud projects add-iam-policy-binding project2 \
--member "serviceAccount:service-<project1 number>@gcp-sa-mcsd.iam.gserviceaccount.com"
--role roles/multiclusterservicediscovery.serviceAgent

Next, give the MCS Importer deployment in both clusters access to view the network info across both projects (including, importantly, Traffic Director):

gcloud projects add-iam-policy-binding project1 \
--member "serviceAccount:project1.svc.id.goog[gke-mcs/gke-mcs-importer]"
--role roles/compute.networkViewer
gcloud projects add-iam-policy-binding project1 \
--member "serviceAccount:project2.svc.id.goog[gke-mcs/gke-mcs-importer]"
--role roles/compute.networkViewer

gcloud projects add-iam-policy-binding project2 \
--member "serviceAccount:project2.svc.id.goog[gke-mcs/gke-mcs-importer]"
--role roles/compute.networkViewer
gcloud projects add-iam-policy-binding project2 \
--member "serviceAccount:project1.svc.id.goog[gke-mcs/gke-mcs-importer]"
--role roles/compute.networkViewer

This is one of the steps that was not documented; as this is not a shared VPC architecture, thus the MCS Importer needs access to both projects, in both clusters.

Register the clusters with GKE Hub
Next, we should register the clusters with GKE hub. This will provision MCS Importer and more. Start with the fleet host, and then continue on with cluster2.

gcloud container fleet memberships register cluster1 \
--project project1 \
--gke-uri "https://container.googleapis.com/v1/projects/project1/locations/<project1 location>/clusters/cluster1"
--enable-workload-identity
gcloud container fleet memberships register cluster2 \
--project project1 \
--gke-uri "https://container.googleapis.com/v1/projects/project2/locations/<project2 location>/clusters/cluster2"
--enable-workload-identity

Testing time!

That’s all for setting up MCS. Now we can actually try it out! We’ll need to create the following ServiceExport object in both clusters:

kind: ServiceExport
apiVersion: net.gke.io/v1
metadata:
name: service1
namespace: service1

In case the service1 namespace doesn’t exist yet in cluster2, make sure to create it first (no need to have the deployment there, though!).

Now, give it a couple minutes wait, and you should see the service import on cluster2:

❯ kubectl get serviceimport --namespace service1 --context cluster2
NAME TYPE IP AGE
service1 ClusterSetIP ["<IP>"] 1m

At this point, you’re set! You can now reach service1 from service2 by using the domain service1.service1.svc.clusterset.local (the format is <service name>.<namespace>.svc.clusterset.local).

Troubleshooting

If you’re not seeing the service imports show up, your best bet is to check the logs of the MCS Importer. This is an example we got, when we hadn’t set up the IAM policy bindings correctly yet:

controller.go:187] Handler error: receiving ADS response over stream: permission denied: rpc error: code = PermissionDenied desc = Permission ‘trafficdirector.networks.getConfigs’ denied on resource ‘//trafficdirector.googleapis.com/projects/project1/networks/vpc1/nodes/<pool name>’ (or it may not exist).

You can also check Traffic Director in both projects, to see if the backends show up there (in this case with Google’s MCS example):

Here, you can see 1 of 1 healthy in one the zones, meaning we’ve got the backend pod showing up in Traffic Director; from here MCS Importer should take care of the rest.

Limitations

There are a couple limitations with MCS, especially for this setup, that are spread around the official documentation or might not be included at all:

MCS Importer availability
The MCS Importer is, at least by default, only using a single replica. As the importer is used to sync backend IPs between clusters, there are scenarios possible where the services could become unavailable if the MCS Importer pod has an issue. We’d like to see in the future the possibility of the importer to have multiple replicas, maybe with a leader election. This seemingly is not possible yet, though documentation for the importer is very sparse (as it’s closed-source).

Cross-project service backends
Although it is possible to have backend pods for a single service across clusters with MCS, this is not possible between projects. However, you can still communicate cross-project with services as we’ve done in this article.

Conclusion

We’ve been able to successfully set up MCS with this approach. Due to above limitations (and more as seen in the documentation), we don’t consider MCS to be a particularly long-term solution, and we will be looking at other products such as Cilium Cluster Mesh in the future. However, it is at the moment a quick and easy solution for our migration use case.

--

--

Deniz Zoeteman
The Zeals Tech Blog

DevOps/SRE specialized in cloud and container architecture, CI/CD and backend development.