VPC design considerations for Google Cloud

Pavan kumar Bijjala
13 min readOct 6, 2022

--

Covering 201 to 301, a quick recap of all network design fundamental considerations used in Google Cloud’s VPC.

Will be reviewing below topics,

  • VPC networks.
  • Network Isolation & Constraints.
  • Network Connectivity (Peering & Sharing).
  • Best practices, in brief to keep considerations short.

PS: Assumes that readers to have basic understanding of GCP’s network elements like VPCs, subnets, routes, FW rules, Load Balancers & DNS.

VPCs

A Virtual Private Cloud (VPC) network is a virtual version of a physical network, implemented inside of Google’s production network, using Andromeda. VPC networks are global resources, including their associated routes and firewall rules, which spans across regions i.e., giving applications a private connectivity (without traversing the public internet) by default.

It’s a unique selling point for GCP, for hosting low latency application workloads without a need of replication of network resources. Resources like FW rules are eventual consistent, across regions btw.

  • VPCs even-though global are project level resources wrt GCP resource hierarchy, and inter GCP project network connectivity requires either Peering or Sharing VPCs.
  • Traffic is controlled through FW rules, applied on VM end point (i.e., no single choke point), but at VPC (not Subnet) level. There is a hard limit on number of FW rules.

This limit can reach quickly for transit hub network deployments, where traffic transit through central hub. Slicing or scaling (multi-host) can help, will review.

More into VPC limits, here, as you design or size your network.

We shall comes back to network design, after briefly reviewing subnetworks & routing.

Subnets

Subnet is a regional resource which means they can spans across multiple zones. In automatic mode, subnets are created one in each region.

All subnets have a primary CIDR range, which is the range of internal IP addresses that define the subnet. A VM can have alias IP ranges from that primary range, or you can add a secondary range to subnet and allocate alias IP ranges from the secondary range.

  • The primary IP address must be allocated (for the VM’s NIC) from the CIDR primary range. Each subnet must have exactly one primary IP range.
  • Subnet overlap checks are done across peered networks ensure that primary and secondary ranges do not overlap.

On DNS lookup, please note that, GCE does not associate alias IP addresses on the primary interface (nic0) with the host name, and it does not associate any IP addresses of secondary interfaces with the host name. Entry has to be manually added to DNS.

  • Cloud Native load balancer only route traffic to primary interface.
An example VM with multi-NIC, with legs on different subnets, and example Alias IPs allocated. A typical pattern for hosting Network Appliances.

Subnet’s Private Google Access provides connectivity to Google services without need for Public IP. Default is OFF. Flow logs — are off by default

Cloud NAT, Private Google Access, VPC Flow Logs, and alias IP ranges — are all configured per subnet. Hence recommended to use additional subnets for fine grained access control of the traffic.

Routing

Routes, define the paths that network traffic takes from a virtual machine (VM) instance to other destinations. A route consists of a single destination (CIDR) and a single next hop (default is Internet Gateway).

Available next-hop options in GCP.

Internally each VM instance has a controller that is kept informed of all applicable routes from the network’s routing table, including updates done to the routing table.

A ‘Default’ network system-generated routes. For completely isolate your network from the internet or if you need to replace the default route with a custom route, you can delete the default route.

Routes still require firewall rules to allow traffic.

Since routing from subnet to subnet is automatically provisioned, you cannot insert a Traffic Filtering Appliance in GCP, like Palo Alto.

There is also only one global routing table (when compared to AWS, AWS has a MAIN VPC routing table but also new routing tables that can be defined at subnet level) in a given project. Each route (row) is defined at VPC level.

As there is no need to create routes between regions or subnets, the number of subnets or size of subnets does not affect routing behavior.

Alias IP ranges are routable within the GCP virtual network without requiring additional routes. You do not have to add a route for every IP alias and you do not have to take route quotas into account.

You would only require Cloud router when routes (either Static or Dynamic) are to be shared across networks.

  • Cloud router would be able to advertise route from other regions too, see in above example for us-west2.
  • Can advertise route for external network (yellow color route)

Network Isolation and Constraints

  • FW rules
  • FW policies
  • Private google access
  • Organizational Constraints
  • Service Controls

FW rules

FW rules are stateful, means defining them one side is always honors the return traffic. There are Implicit rules and Explicit rules.

Implicit rules can’t be changed by the user, created when the VPC network is created, with priority of 65535. Default implicit rule is,

  • Allow all Egress & Deny all Ingress

User defined explicit rule(s) has to take a lower number (65534 or less) so that explicit rule(s) takes priority. For example, take default FW rules that are defined in ‘default’ network, takes priority number one less than implicit rule, as you can notice from table below.

Unless you choose to disable it, each new project starts with a default network.

default rules — all internal traffic i.e. 10.128.0.0/9 for us-central1 is allowed, plus SSH from any IP

Most of organizations won’t be using the ‘default’ network so above FW rules won’t be present in a user created VPC anyway. Hence feel free to use 65534.

It’s suggested to block egress traffic for production workloads, otherwise it’s allowed by default by Implicit rule.

Some exceptions on network traffic from default rules, see Appendix for details.

FW rule target is a place where the rule is applied, a ‘network tag’ & ‘service account’ of respective VM can be part of it.

When designing or evaluating, keep in mind the following best practices:

  • Implement least-privilege principles. Block all traffic by default and only allow the specific traffic you need. This includes limiting the rule to just the protocols and ports you need.
  • Use hierarchical firewall policy rules to block traffic that should never be allowed at an organization or folder level.
  • For “allow” rules, restrict them to specific VMs by specifying the service account of the VMs.
  • If you need to create rules based on IP addresses, try to minimize the number of rules. It’s easier to track one rule that allows traffic to a range of 16 VMs than it is to track 16 separate rules.
  • Turn on Firewall Rules Logging and use Firewall Insights to verify that firewall rules are being used in the intended way.

Private Service Access (PSA)

Google and third parties (together known as service producers) can offer services with internal IP addresses that are hosted in their VPC network.

Service producers are requires first allocate an internal reserved IPv4 address range (see detailed onboarding process) and then consumer create a private connection (btw it shows as a peering connection in consumer VPC console, so all peering limits can apply to private google access).

Using IPv6 address ranges with private services access is not supported.

Btw each subnet has a flag called Private Google Access (PGA) (off by default), for VM instances to reach Google APIs and services; to route traffic privately. This is recommended approach in reaching to Google APIs instead of PSA. Instances use internal IPs only. Once instances are having external IP then PGA privilege is not used. GCP allows you to create external IP even if subnet is enabled for PGA. Having said,

  • Egress firewalls must permit traffic to the IP address ranges used by Google APIs or Services.
  • If you use the private.googleapis.com or the restricted.googleapis.com domain names, you'll need to create DNS records to direct traffic to the IP addresses associated with those domains.

Refer to Configure Private Google Access for on-premises hosts.

To summarize the private access options in each category,

You can use a Serverless VPC Access connector to let Cloud Run, App Engine standard, and Cloud Functions environments send packets to the internal IPv4 addresses of resources in a VPC network.)

Network Policies

Firewall policies let you group several firewall rules so that you can update them all at once, in one or more projects. Rules are evaluated for each network interface (NIC) of the VM. These are Hierarchical policies can be of 2 types based on the scope i.e, Global & Regional network firewall policies.

Evaluation Order of Network Policies

Lower-level rules cannot override a rule from a higher place in the resource hierarchy. This lets you create exceptions for groups of VMs. Hierarchical firewall policy rules do not support targeting by instance tags.

A single policy can be associated with multiple nodes (an organization or a folder). Only one firewall policy can be associated with a node.

In Shared VPC scenarios, a VM interface connected to a host project network is governed by the hierarchical firewall policy rules of the host project, not the service project. Where as hierarchy doesn’t get impacted in case of In VPC Network Peering scenarios.

Controlling east-west traffic

Resource Manager tags let you define sources and targets in network firewall policies and regional firewall policies. Resource Manager tags (referred to as tags) are different from network tags. Network tags are simple strings, not keys and values, and don’t offer any kind of access control.

Binding a tag to a resource attaches a tag value to a resource. Although a tag can have multiple values for a given key, you can bind only a single value per tag key to a resource. For example, you cannot bind both web-backend and mysql tag values to the same VM instance as they belong to the same tag-key vm-function.

Organizational Policies

An organization policy is a restriction or constraint that you can set over the use of a resource, unlike traffic in case of FW network policies. For example, you may want to restrict the use of public IPs to some specifics VMs only (or none). The restriction is set on a resource hierarchy node, meaning you set it at the organization, folder, or project level.

Google provides a sample repository with a set of pre-defined constraint templates. Your organization can extend these or write custom policies for network services. Recommend to use policy validator in your CI/CD flow to validate your Infrastructure as Code or Resource definitions.

Example Network Policies & constraints, for an Enterprise

Service Controls

Service Controls brings isolation by creating perimeters that protect the resources and data of services that you explicitly specify.

  • Clients within a perimeter that have private access to resources, and do not have access to resources outside the perimeter.
  • Data exchange between clients and resources separated by perimeters is secured by using ingress and egress rules.

While IAM enables granular identity-based access control, VPC Service Controls enables broader context-based perimeter security, including controlling data egress across the perimeter. Context-aware access to resources is based on client attributes, such as identity type (service account or user), identity, device data, and network origin (IP address or VPC network).

Capabilities of Service Controls in GCP

Use VPC Service Controls in dry run mode to monitor requests to protected services without preventing access and to understand traffic requests to your projects. You can also create honeypot perimeters to identify unexpected or malicious attempts to probe accessible services.

Refer to Security benefits of VPC Service Controls, here.

Network Sharing

Network Peering

For connectivity across VPCs with an organization/projects or across organizations. (so that traffic stays within google network, and no public internet)

VPC Peering offers the highest network throughput and lowest operational cost for connectivity between two networks, but comes with a few, distinct tradeoffs. First and most critically, each network must consist of a set of non-overlapping IP ranges. Second, routes are automatically created for subnet ranges within the peered networks and no custom route advertisements are possible. Last, the limits and quotas of peered networks are shared, restricting potentially many networks to the limits and quotas of a single network.

Alternative to VPC peering, is Multi-NIC instances do not share the limit and quota problems of VPC peering but have a high operational and maintenance cost: both the network routes and the instance must be configured to forward, route, or proxy traffic over a given connection.

  1. One VPC network can peer with multiple VPC networks.
  2. Peered VPC networks remain administratively separate.
  3. VPC Network Peering works with PaaS like Compute Engine, GKE, and App Engine flexible environment.

Restrictions of Network Peering

  • Subnet CIDR range can’t overlap with other VPC network’s subnet CIDR range — inorder to exchange dynamic routes.
  • Can’t use network tags or service accounts across.
  • Internal DNS can’t be used by other networks, it should use IP.
  • Accessing the on-premises network through VPN, is only possible on the same region as VPN router’s region from peered network i.e., us-west-1 from network A. (read https://cloud.google.com/vpc/docs/vpc-peering#on-premises_access_from_peer_network)
  • VPC network peering is unidirectional
  • FW rules or routes are not exchanged across, have to be configured explicitly.
  • Unless we use dynamic routing while creating VPC, to learn routes from peered networks. Dynamic routing can be regional or global.
  • Can only create direct peering and no transitive peering.
  • Transitive peering is only used in connecting to on-premises networks (via network B, as shown below).
Network A & C can be connected to transitive peering network B, to connect to on-premise networks. But C is not reachable to A.
  • Peering is possible across shared VPC networks
  • VMs with multi-NIC, can be peered but isolation is provided as explained in below scenario.
VM1 nic1 can’t connect to VM2 nic0, eventhough VM2 is peered with VM1’s nic0.

Refer to VPC Network Peering limits, for details on number of peering connections or forwarding rules allowed can be used in a project..etc

Shared VPC

Centralize control with Shared VPC

Can only be done within organization. Shared VPC is created on a host project. And Service project is a project that gets attached to the host project.

  • A service project can only to be attached to one host project
  • A project can only be a host project or service project, can’t be both.
  • A VPC can only be shared from host projects, to service projects.
  • If a project is made as a host project then any VPC created in that project will become shared.
  • A host project can have more than one shared VPCs.
  • Existing instances in a service project must have to be recreated to use the resources in shared project i.e., host project.
  • IP management done centrally at host project.
  • Setup of Interconnect done centrally in one host project (no hub and spoke architecture).
  • Shared VPC becomes a security zone.

GCP — Shared VPC vs VPC Peering

  • Subnets (and hence it’s FW rules & Routes) can be shared in Shared VPC but not in Peered VPC.
  • VPC Peering can provide more isolation to workloads across projects (independent FW rules).
  • Shared VPC, can have chances of IP exhaustion.
  • Max of 25 networks can be peered. Thus shared VPC in so-called hub-and-spoke designs gives flexibility.
  • Alternative, is Private Service Access i.e., proxy to your VPC.
  • Shared VPC is only within an organization.
  • Tenant isolation is done by default (aka natively) in VPC peering but not in shared VPC

In an example VPC topology, shown below, we are using shared VPCs with shared Interconnect project. This topology is meeting the considerations, along with sharing interconnect.

  • Dedicated interconnect billing happens in DI project, and it is sharable to shared VPCs host projects. VLAN attachment get billed against Host project. Egress traffic from VMs billed on Service project.

Best practices in using VPCs

Using isolation (when designing multiple host projects) can also introduce the need for replication, as you decide where to place core services such as proxies, authentication, and directory services. Using a Shared Services VPC network can help to avoid this replication, and allow you to share these services with other VPC networks through VPC Network Peering, while at the same time centralizing administration and deployment.

VPC network security best practices,

Would recommend reading through Best practices for VPC design and reference architectures | Solutions for more & further details.

Appendix

GCP does not cap the ingress or egress traffic — but it all depends on what the machine can handle and how much network can take.

  • Select the right core-count for your networking needs. Each core goes through a 2 Gbits/second (Gbps) cap for peak performance. Each additional core boosts the network cap, as much as a theoretical optimum of 16 Gbps (say) for each virtual machine.
  • Put your instances in the right zones, light can take 10ms to travel a distance of ~ 3000km i.e., New York to Dallas. Optical fiber scales this down by 2.
  • Use internal over external IPs
  • TCP window sizes on common GCP VMs are tuned for high-performance throughput, see defaults.

Wrt network traffic handling below are the exceptions from FW rules, Egress is allowed always

  • To TCP destination port 25 (SMTP)
  • Protocols other than TCP, UDP, ICMP, AH, ESP, SCTP, and GRE to external IP addresses
  • Certain GRE traffic
  • Traffic to Google APIs & Services, unless Private google access is enabled or External IP address is assigned.

Always allowed, in Ingress:

  • DHCP, DNS
  • NTP (Network Time)
  • Instance Metadata
  • Instance loopback or alias IP address, load-balancers IP addresses.

Further Reading

Configure VMs for networking use cases: https://cloud.google.com/vpc/docs/special-configurations

Review this GKE scenario: What am I doing wrong with private GKEs peering and Cloud NAT?

Google has a network intelligence center for discovering network topology & observing metrics.

--

--

Pavan kumar Bijjala

Architect @Accenture | Cloud as your next Enterprise | App modernization | Product Engineering