Network Architecture Design for Microservices on GCP

17 min readSep 2, 2019

This is our goal architecture design, please read the article to understand the journey :)

This blog article is participating in the Mercari Bold Challenge month (#6)

Hi everyone, this is Raphael from the Microservices Platform team at Mercari. Bluntly introduced, we are a post-IPO Japanese C2C (Customer to Customer) marketplace transitioning from a monolithic to a microservices architecture.

A few months ago, we were looking for content on how to design a network architecture for microservices in Google Cloud Platform (GCP) but we couldn’t find much about it so let’s try to fill this hole!

This article intends to share our experience about thinking, planning, and designing a network architecture for large-scale microservices in GCP.

We recommend you to be familiar with the following concepts to fully enjoy this article:

Quick infrastructure introduction

Mercari infrastructure in a few numbers:

100+ microservices
100+ VPCs
2 main Google Kubernetes Engine (GKE) clusters (1 Production and 1 Development)
5+ secondary GKE clusters
2 countries (Japan and USA)
200+ developers
3k+ pods

We also manage our financial services subsidiary infrastructure, which implies enforcing strong security and compliance in our architecture.

Please check out our microservices platform presentation for more details!

Our microservices model

The microservices tenancy model in Mercari

Microservices are defined as tenants in our architecture. Each microservice has:

A GCP project
A GKE namespace

Some exceptions don’t have either one. Most microservices are running on a shared central GKE cluster maintained by the microservices platform team. This cluster is located in one GCP project, with a default VPC using the default subnets. All microservices are natively routable thanks to Kubernetes network specification making each pod routable and unique by its IP address in the same network.

Issues with our current environment

Cluster-internal traffic works fine, all our microservices can potentially access any others. The issue comes when external traffic is required. External traffic may take different forms:

Traffic destined to internal services in other VPCs
Traffic destined to GCP managed services
Traffic destined to external tenants (third-party, Internet…)
Traffic destined to our on-premises datacenter, AWS

Each case has its issues:

Traffic destined to internal services in other VPCs

Despite trying to keep consistency in microservices by using a central GKE cluster, some teams inexorably created their GKE cluster in different GCP projects, thus in different VPCs.

The network was not a priority at that time, so each team left the VPC default settings, default instances parameters using external IPs, leading to IP overlap as all subnets use the same IP ranges by default and full disclosure of the GCE instances. Using VPC peering would make sense in this case but all traffic was routed to GCE instances public IPs thus people didn’t care much about it.

This led to an “everything owns a public IP and communicates publicly” standard, which is bad security-wise and cost-wise.

Traffic destined to GCP managed services

Most GCP services can be consumed through APIs, which doesn’t interact with VPCs but directly call endpoints using L7 Authentication/Authorization. These services are fine when called from GKE pods.

On the other hand, some GCP services require to be assigned within a GCP project, sometimes even consuming internal IP addresses. A known limitation (for security) of GCP is the strict packet source/destination checking, dropping packets not originated by a VPC internal subnet IP address. This is the case for GKE pods, as they run in a different network than the cluster nodes.

One feature of GCP called Private Services Access is used by CloudSQL to provide database instances to a GCP project. Instances are not running in the customer VPC but in a GCP-managed VPC, which is automatically peered with the customer VPC. When a GKE pod wants to access a CloudSQL instance, its traffic will be dropped at the VPC edge as the pod IP is not belonging to the VPC network, resulting in the pod not able to consume the database.

Cloud Memorystore, a managed Redis (Cache software) also has the same issue and our developers requested this service many times, which led to becoming one of the main reasons to redesign the network architecture.

Note: There is a hacky way to allow it, but we wouldn’t use it in production.

Traffic destined to external tenants (third-party, Internet…)

All instances having a public IP address will use it to perform egress communication, leading to possible security risks as they advertise their global public IP. While this could be mitigated by having fine-grained firewall rules, Kubernetes hosts so many services using many ports, NodePorts, that it is hard to enforce them.

Moving our servers to a private network and disabling their public IP is a high priority to reduce the attack surface of our infrastructure.

Traffic destined to an on-premises datacenter, AWS

We still have most of our monolith on-premises, mainly our databases. The GKE clusters are home to microservices but they need to connect to the monolith as part of the migration. To do so, we are using L7 proxies with certificates and a private Border Gateway Protocol (BGP) peering between our and Google routers. This is different than Cloud Interconnect as we advertise our BGP ASN to Google routers and get a direct link to our rack through our datacenter partner.

While this is good technically-wise, we cannot propagate GCP VPCs routes to our on-premises routers and are forced to use public IPs from our public IPv4 block. We’d like to have an option to easily provide a dedicated Cloud Interconnect link to any microservice or other tenants running in our VPCs, with minimum cost and toil.

We also plan to have a multi-cloud infrastructure to allow developers using AWS services with GCP securely, reliably and with good performance, leveraging Cloud Interconnect and AWS Direct Connect to create a multi-cloud network hub.

Summary of issues

Cross-VPC Security
Cross-VPC Traffic Cost
Cross-VPC Link reliability
GCE Instances security
Inability for GKE pods to perform Cross-VPC connectivity
On-premises and multi-cloud connectivity
Lack of network resources management

It became obvious we couldn’t continue in this direction as we release more and more microservices every month. Therefore, we decided to design a scalable network architecture for our microservices.

When planning a new architecture, it is important to properly understand what you need and what you don’t.

Defining the new architecture goals

Using the issues we described above, we set the following goals for our new network architecture design:

Harden East-West security between GCP projects
Reduce East-West traffic cost
Make East-West traffic more reliable
Disable GCE instances public IPs and enforce internal traffic
Enable Cross-VPC connectivity for GKE pods
Enable production-grade on-premises and multi-cloud connectivity
Define a multi-tenancy network management design

What were the challenges in designing it?

Challenge 1: Multi-tenancy design
Challenge 2: Which network ownership model should we use to enforce IP address management?
Challenge 3: How big do we need to think?
Challenge 4: Private IPv4 addresses exhaustion
Challenge 5: Identifying edge cases
Challenge 6: Managing multiple regions in a Shared VPC
Challenge 7: Making GCE instances private only

Challenge 1: Multi-tenancy design

Our microservices are defined as a tenant in our company and we created many automated tools around this model of having each team provisioning their GCP projects and resources through GitHub and Terraform.

Giving flexibility to developers while providing adequate guardrails is a core concept of our microservices platform.

Logically, the network should also follow the same vision. In the current architecture, each microservice has a GCP project and VPC. In the case some microservices wanted to access other microservices resources, not in the central GKE cluster, they’d have to create a VPC peering, making statics groups of VPCs.

VPC peering has a hard limitation of 25 VPCs in a peer network (VPCs that are peered in the same group) which make this option impossible to use as we already have over 100 microservices. Also, VPC peering requires participating VPCs not having any IP overlap. This would prevent all microservices VPCs using default subnet IP ranges to peer.

Solution 1: Use Shared VPC to enable multi-tenancy

GCP Shared VPC allows different GCP projects belonging to the same organization to share the same VPC network.
By sharing one VPC, all participating GCP projects don’t have to create any VPC Peering to connect. On top of that, we can define VPC subnets permissions to create a multi-tenancy model by granting each GCP project a VPC subnet. All VPC subnets are natively routed within the same VPC network, regardless of regions.

Moving to Shared VPC forces you to take a new look at the architecture as it is a centralized model. It implies many challenges we didn’t know about before which we wrote about below.

tl;dr: We found out the pros outweighed the cons.

By nature, Shared VPC enforces all participating GCP projects to not have any IP overlap. This is one of the main challenges to solve to use such a solution. Which leads to the next challenge:

Challenge 2: Which network ownership model should we use to enforce IP address management?

Microservices network ownership model

Traditionally, enterprises have a dedicated network team, responsible for the network architecture design, infrastructure configuration (routing, firewall…) When its company starts adopting microservices, a network team will see its workload increase linearly with the microservices count growth. It would rapidly become a Single Point of Failure (SPOF) if it relies on manual and reactive operations.

To avoid this bottleneck, it is primordial to automate network-related processes and operations. Ideally, microservices owners would take this responsibility but in reality, they often lack the skills to properly understand and manage network components configurations, firewall policies, routing…

In consequence, the network team need to provide automated configurations interfacing with the other automation tools used to provision microservices infrastructure.

Entity network ownership model

In the case entities are separated, such as our case with Mercari JP, Merpay and Mercari US, it is usual to have one network team per entity. Each network team would be responsible for its entity and collaborating with other entities in specific cases. Still, the network is such a common foundation layer for a business that it is typical to see one global network shared by entities from the same group, sometimes even with third-parties. Having a complex network architecture with multiple datacenters, global network backbones, Multiprotocol Label Switching (MPLS), Network Address Translation (NAT) everywhere have been an enterprise-standard for decades.

Although, by experience, the more complicated the network ownership model is the more fastidious the collaboration between stakeholders is.

If you have an opportunity to simplify, don’t overlook it!

This statement is very useful when doubting about a design, feature, especially in network. Don’t use NAT unless you have no choice, don’t use complex routing policies where you can get the job done with simple ones etc…

Solution 2: Having a central network team with automation

In the end, we decided to have one network team that would manage the global network for all entities.

The reason lies in Conway’s law:

‘organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations’

Enforcing a non-IP overlap is much harder when multiple teams manage the IP assignment.

Having a central team is the best way to ensure the network standards are clear and respected. It brings cohesion to the overall network design and management. But as stated above, the central team mustn’t become a gatekeeper or a bottleneck for other teams. Providing developers flexibility is crucial.

A central team needs the proper automation and self-servicing capabilities to scale with microservices.

This is what we did by implementing a Terraform-based Github repository automatically generating Shared VPC projects, attachments, subnet provisioning with helper scripts. Any microservice team which need a subnet for their GCP project can get it automatically using the above repository by sending a Pull Request with the generated configuration from our helper script. The IP assignment is not fully-automated but has a low-enough overhead to not let the network engineers drown.

Challenge 3: How big do we need to think?

When designing such an important design involving and impacting an entire group, it is easy to feel overwhelmed by information and scope.

What is the best way to get from the white page syndrome to something you can deliver?

Solution 3.1: Define a “rough” capacity planning

This is an important part of the network architecture design as it should be one of the requirements for the design.

Scalability should not be compromised by architecture, consequently, it requires input from all stakeholders susceptible to consume the infrastructure.

By discussing with several internal customers, we defined some capacity planning for the GCP services that would require private IPs:

Solution 3.2: Keep flexibility in the process

Don’t be too static when designing the architecture, while it is easy to be conservative, there is no point if you take years to make your design.

Identify as many two-way doors decisions as possible while keeping the base of your architecture a high-quality decision.
Not every part of the design is ever set in stone, even less in our time.
Define what will be a one-way door decision, two-way door decisions first and tackle the one-way first.
Using this mental model proved very useful to us to deliver a high-quality design in a relatively short period (~6 weeks)

Below are some questions useful to ask:

How much will our GKE, GCE usage grow over the next 3 years? 5 years?
Is our design enabling future technologies such as serverless?
How much capacity do we need for Disaster Recovery?
Is our design future-proof? Is it evolutive?
What would be the pain points in managing such a design?

With this information, we were able to design an IP address assignment for the entire Group, including our US entity and to get a clear network architecture design.

Challenge 4: Private IPv4 addresses exhaustion

Private IPv4 addresses are a very scarce resource. We can only have around 18 millions of it, split into 3 classes. Kubernetes and pods bring new requirements on the IPv4 address consumption by giving every pod a private IPv4 address.

While this didn’t cause many issues in the past as overlay networks were isolated, GCP brought pods as a network first-class citizen by releasing Alias IP. Alias IP grants every pod in a Kubernetes cluster a Private IPv4 address from the VPC CIDR block the cluster belongs to.

Below is a breakdown of Kubernetes IP addresses usage (for a 1000 nodes GKE cluster with default settings):

GKE Nodes CIDR: /22 (1024 IPs)
Pods CIDR: /14 (262144 IPs), each node has a /24 (256 IPs) portion allocated
Services CIDR: /20 (4096 IPs)

A 1000-node GKE cluster requires around 267k IP addresses, which is ~1.5% of the total RFC 1918 IPv4 pool!

When you want to scale to several clusters, the number starts becoming preoccupying: 8 clusters take 12% of the RFC 1918 IPv4 pool. Add a simple failover region disaster recovery on top of that and you get a quarter of it eaten up!

Kubernetes is extremely “IPvore” so we had to find solutions to make it use fewer IP addresses.

Solution 4: Using Flexible Pod CIDR

We partially solved the problem using Flexible Pod CIDR, sacrificing pod density for IPv4 address saving. Sacrificing pod density can be an important matter as it virtually limits the total compute capacity for a GKE cluster. We carefully compared this limitation with our capacity planning to find a good balance between loss of capacity and IPv4 address savings.

Reducing a Pod CIDR from /24 (256 IPs, max 110 pods per node) to /26 (64 IPs, max 32 pods per node) has a huge impact. Considering the earlier example, we get from 267k IP addresses to 70k, a substantial 74% decrease! On the other hand, we theoretically get from 110k max pods to 32k pods per cluster, which is about 71% decrease in pods capacity. 32 pods per node felt the sweet spot based on our GKE utilization and capacity planning. We made the call since this was the best compromise we could make, considering IPv4 address saving more important than pods capacity and the max number of pods per node. Your mileage may vary depending on your priorities.

Challenge 5: Identifying edge-cases

As there is no magical solution in the world, cloud providers also come with their bunch of technical limitations they try to hide from customers, until a given edge case is found out. This is the kind of relationship we had with GCP teams during our exchanges, having to bend our design to ensure we could do what we planned with minimal friction. Identifying edge cases takes a lot of time and it is easy to fight mirages.

Solution 5: Research limitations extensively but with moderation

We started with sifting through the GCP network documentation to identify all technical limitations with each product, which was sometimes painful as there are dependencies between some products, especially with Shared VPC.

We could list the first set of technical limitations at a large scale, such as (as of August 2019):

Max Shared VPC Service Project per Host Project: 100
Max number of subnets per project: 275
Max secondary IP ranges per subnet: 30
Max number of VM instances per VPC network: 15000
Max number of firewall rules per project: 500
Max number of Internal Load Balancers per VPC network: 50
Max nodes for GKE when using GCLB Ingress: 1000

Looking at these numbers seems scary when going at a very huge scale. 15k VMs means 15 GKE clusters of 1000 nodes if Kubernetes was the only GCE resources we use in GCP. This is the limit we fear the most yet we agreed on dealing with.

The other one is the maximum number of subnets per project, which means we can at most have 275 microservices, less the reserved ones for Disaster Recovery. We agreed on using Shared VPC since only a few microservices will require a dedicated VPC subnet.

These limitations also confirmed our will to have mirrored development and production network infrastructure, completely isolated from each other to avoid cross-environment violations as well as preventing an englobing Shared VPC from reaching its limits twice as quick.

The important takeaway here is the ability to find the consensus between edge-cases, your capacity planning and your risk assessment.

In our case, we decided to go with these limitations predicting that:

These would be lifted in the future, with as few redesigns as possible
We might not achieve this scale (obviously we want to!)
We made many two-way doors decisions so it is a calculated risk

Challenge 6: Managing multiple regions in a Shared VPC

Shared VPC spans across the globe by definition so it seemed easy to create a multi-region network architecture. However, we had to choose the design which would fulfil our architecture goals the best while solving the challenges we mentioned.

We defined 4 options for the multi-region Shared VPC design:

Option 1: 1 Global Shared VPC Host Project, 1 Shared VPC network per region peered with VPC peering
Option 2: 1 Global Shared VPC Host Project, 1 Global Shared VPC network
Option 3: 1 Shared VPC Host Project per region with VPC peering
Option 4: 1 Shared VPC Host Project per region without VPC peering

After weighing in on each option’s pros and cons, we choose Option 2 for the following reasons:

It has the simplest management with a centralized Shared VPC Host Project for the entire group, referring to Solution 2.
It is the easiest way to implement the infrastructure logic in GitHub and Terraform
Interconnection between regions is straightforward and leverages Google Global VPC Network
It fulfils the architecture goals and our guesses in Solution 5

Challenge 7: Making GCE instances private only

With Shared VPC, the internal connectivity issue between all GCE instances within the VPC is straightforward and secure. This allows us to remove Public IPs addresses from all GCE instances, but when doing so, GCE instances lose Internet connectivity.

How can we ensure Internet connectivity to VMs while ensuring scalability?

Solution 7: Use Cloud NAT in the Shared VPC

Cloud NAT is a managed-NAT service provided by GCP. It focuses on outbound NAT for GCE instances to provide outbound Internet connectivity. It is deployed regionally thus we need to create at least one Cloud NAT instance per region. In contrary to the existing NAT services, Cloud NAT is embedded into GCP Software Defined Network (SDN), not using traffic from VMs network interfaces and is highly scalable. By defining one public IP in Cloud NAT, we can have at most 64k TCP ports and 64k UDP ports supported. The default setting for each GCE instance is 64 ports, which might be not enough for GKE pods.

Therefore, we need to fine-tune the number of NAT IPs/number of ports allocated per VM to find a good balance for GKE nodes.

For specific use cases, a secure project can also choose to use a dedicated CloudNAT while participating in Shared VPC, which is a good point for our sensitive workloads.

Microservices network architecture design for GCP

Based on all the considerations we explained above, we created a dedicated Shared VPC Host Project with one global VPC network. This act as the hub for all network components, getting Cloud NAT, Global Cloud Router with Dedicated Cloud Interconnect, Cloud Memorystore provided there because of technical limitations with Shared VPC.

The owner of this GCP project is the central network team hence only network resources are managed within it.

Within the VPC network, each GCP project requesting a VPC subnet gets one, with the choice of the region. The US entity has its subnets under ‘us-west1’ while the JP entity has its subnets in ‘asia-northeast1’ and ‘asia-northeast2’ for DR purposes. One microservice GCP project has two subnets, in two different regions to replicate compute resources. Since all tenants can communicate with each other in the Shared VPC, other tenants requiring more security can use features such as VPC SC (in the future) or global firewall rules restricting access to specific VPC subnets.

Our vision is to enable a flat L2 network across entities, leveraging GCP Andromeda SDN capabilities and enforcing security at L7 exclusively, through the use of Mutual TLS (mTLS) and GCP Identity Access Management (IAM).

Conclusion

In this article, we explained the issues we faced with an unprepared network microservices infrastructure. After identifying the issues with the default network settings in GCP, we used these issues to set our new architecture goals.

During this journey, we found many challenges and proposed a solution for each, going through the network ownership model and how it affects an architecture design and network operations, the importance of IP policy enforcement with Shared VPC, the challenges in managing scarce IP addresses, identifying edge cases, thinking about multi-region and making our infrastructure private and more secure.

This resulted in implementing an evolutive, multi-tenancy, multi-regional and secure microservices network architecture design in GCP.

What’s next?

There are still many unknowns around GCP network features. Their roadmap is very ambitious and some features could end up completely different than today. We’re looking forward to the removal of many limitations around the VPC network model to ensure our growth and the perennity of this architecture. We have many improvements we’d like to start working on, such as establishing a complete network self-service model.

This should allow developers to:

Freely choose subnets to consume
Define security rules by simply inputting microservices names in a configuration file
Automatically provision GCP services requiring network configuration

Closing

Thank you very much for reading this until the end! (As long as you made it, no matter the path taken!)

Did this article help you? If you enjoyed this post, I’d be very grateful if you’d help it spread by emailing it to a friend or sharing it on Twitter or Linkedin. Please don’t hesitate to DM me if you are interested in this work and leave your comments below! It would be a pleasure to read them :)

We are also looking for talented members to join our team!

We’d finally like to thanks the GCP TAM team from Tokyo office for answering our questions.

Cheers,

Raphael

📝 Read this story later in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.