Beauty of routing in GCP — how to achieve VPC transitivity
The absence of VPC transitivity in the realm of the public cloud is often a significant concern. However, with the right utilization of VPC Peering and effective network design, this becomes a non-issue. But what if you encounter a scenario where the network topology is already established, unchangeable, and certain services accessible via VPC Peering are inaccessible to you? In GCP, numerous managed services rely on VPC Peering, PSC doesn’t have broad support (yes, I’m talking with you AlloyDB), so it’s not entirely impossible to face this situation.
What are VPC, VPC Peering, and VPC transitivity?
VPC stands for Virtual Private Cloud, and it is a cloud concept that allows users to create and manage their isolated virtual networks within a public cloud environment. It enables organizations to have control over their networking resources, such as IP addresses, subnets, route tables, and network gateways, while keeping their cloud infrastructure isolated and secure from other users in the same cloud provider’s environment.
VPC Peering is a networking feature provided by cloud providers. It allows you to establish a direct private connection between two separate VPCs within the same cloud provider’s infrastructure. This connection enables the VPCs to communicate with each other as if they were part of the same network, even though they might belong to different accounts. VPC Peering is typically used to share resources or facilitate communication between VPCs owned by the same organization or within a multi-tier application architecture.
VPC transitivity refers to the ability to route traffic between two VPCs through a common, intermediary VPC. In other words, if VPC A is peered with VPC B and VPC B is peered with VPC C, transitive peering would allow VPC A to communicate with VPC C via VPC B, even though A and C don’t have a direct peering connection. However, VPC transitivity is not natively supported by all cloud providers. In AWS and GCP, for example, VPC peering is not transitive by default. Each VPC must have its own separate peering connection to communicate with other VPCs. Therefore, direct peering connections between all required VPCs are necessary to achieve full communication in non-transitive VPC environments.
A familiar example?
Let’s consider the following example:
As you can see, we have a direct VPC Peering between the VPC Left
and VPC Right
, then another Peering instance between VPC Right
and the Google Managed Tenant
that hosts several managed services such as Cloud SQL and the control plane for GKE.
Due to the lack of VPC transitivity, the communication between VPC Left
and the services hosted on the Google Managed Tenant
cannot happen. As mentioned earlier, there isn’t any magic GCP native solution that would solve such an issue, having a correct Network Design becomes then a must.
Perhaps the native routing capability of GCP can help?
Replacing the VPC Peering between VPC Left
and VPC Right
with a routing element can alleviate the situation.
In this scenario, when any service connected to the VPC Left
needs to access, for instance, Cloud SQL, hosted on the Google Managed Tenant
which is in VPC Peering with VPC Right
will go through the router.
From a GCP routing standpoint, only a few simple things are needed:
- On the
VPC Left
two custom routes are required to reachGoogle Manager Tenant
andVPC Right
using the routing instances as the next hop; - Similarly, on
VPC Right
a custom route to reachVPC Left
using the routing instances as the next hop; - Thanks to the Exchange custom routes feature of VPC Peering we need to ensure the
VPC Right
peering instance to Import and Export any defined custom routes; - Lastly, for PSA (see next section) Exchange custom routes is needed on the
servicenetworking-googleapis-com
Peering side:
On the routing element, we can make this as much complicated as we want. I choose the KISS approach going down the Linux VM path with the following options define:
- IP Forwarding allowed;
- NIC0 connected to
VPC Left
; - NIC1 connected to
VPC Right
; - gVNIC enabled hoping for as much low latency as possible;
- A
startup-script
with the following content:
#! /bin/bash
sysctl -w net.ipv4.ip_forward=1
ip route add 192.168.245.0/24 via 10.130.100.1
The simplicity of this approach is, honestly, marvelous. No NAT-ing, no proxying, no complex routing layers, no stratified configurations. It’s just some routing on GCP, IP Forward enabled on the VM, and plain-and-simple routing in Linux. It’s lean yet powerful.
Using a dedicated router, fully integrated with the GCP API, may yield the ability to auto-discovering such custom routes. This is also doable through a shell script invoking the gcloud
CLI.
Re-usability
Thanks to the Networking API Private Services Access the same addressing is re-used by GCP across many managed services (Memorystore, Cloud SQL, Filestore, Vertex AI, etc). This considerably reduces the amount of configuration required. GKE, unfortunately, doesn’t follow the same PSA approach. You could come up with some clever routing schema that defines that only certain ranges are used by this scope, but given you’re here in the first place, perhaps it’s a bit late for a clever routing schema.
Scalability
I’m sure the sharpest readers would wonder how well a single GCE instance could handle all the traffic. Google Cloud limits outbound (egress) bandwidth using per-VM maximum egress rates based on the machine type of the VM sending the packet and whether the packet’s destination is accessible using routes within a VPC network or routes outside of a VPC network. Such upper boundaries are pre-defined and well-documented. Generally speaking, for anything, not E2 and with a low CPU count, you’re limited to 10Gbps. This can be further increased through TIER_1 networking up to 100Gbps. The newer C3 instances, thanks to the IPU architecture, start at 23Gbps rather than 10Gbps and can go up to 200Gbps (see my other write-up about it).
From a pure latency standpoint, unless you’re running a DPDK router like VPP, you’re gonna see an increase due to the interrupt fashion of Linux and all other non-carrier routers. Rather than going for a Cattle VM, a way to reduce the latency and scale throughput is to have at least one instance per zone and perhaps more than one where needed. This is enabled by iLB, see the next section.
Addressing the elephant in the room: SPoF
While this single routing instance is good for a PoC or a medium post, how about something a bit more reliable?
How about Internal Network Load Balancers and Managed Instance Groups? With this approach, we can gain true availability thanks to the usage of two persistent iLB (one for VPC Left
and another for VPC Right
). This solves the performance scalability too with also the in-zone low-latency topic covered above. Make only sure to select the appropriate Session Affinity, like Client Source IP
:
The GCE availability is taken care of by a MIG:
The only real aspect a bit janky about this setup is the warning showed by one of the two iLB complaining about forwarding traffic only to instances whose NICs are in the network.
It’s a false warning because the routing instance is connected to both VPC networks. Probably the internal iLB check is done taking into consideration only the instance’s NIC0. Yet this is fully functional.
For a comprehensive documentation, see the official GCP guide.
Considerations
To summarize, the lack of VPC transitivity in the public cloud can present difficulties in facilitating communication across multiple VPCs. Nevertheless, by employing a routing instance, as demonstrated in this case, it becomes possible to overcome this limitation. Although this approach may not be flawless, it offers a practical solution without disrupting the entire network infrastructure. It is worth noting that even the latest Network Connectivity Center VPC Spokes configuration does not facilitate the exchange of static and dynamic routes. I hope this discussion assists you in resolving networking challenges and highlights the remarkable aspects of network routing management in GCP, including its limitations and flexibility.