Stop using Network Appliances in Google Cloud and start embracing native solutions

Published in

Qodea Google Cloud Tech Blog

12 min readJul 19, 2022

Part 1 — Firewalls…

Network Appliances in Google Cloud are an Anti-Pattern Meme

First I am going to start with a confession, I really like networking, so much so in fact that many of my colleagues begin to glaze over, smile, and nod when I get onto the subject. I have studied on-prem networking to a CCNA/CCNP standard and managed networks including physical and virtual appliances on-prem during my nearly 10 years in IT.

On the Google Cloud side, I have now been working with the platform for a little under 3 years, having held the Professional Cloud Network Engineer certification for most of that time, so fair to say I also really like Google Cloud networking.

So why might you ask do I have a problem when ideas and systems from the on-prem world get used in Google Cloud? Well, that is the topic of today’s post so buckle in and enjoy the ride…

Firewalls, on-prem…

Firewalls are wonderful devices that keep bad actors out right? In the on-prem world absolutely, take this network diagram (which looks remarkably like that of one of my former employers!):

Source: https://serverfault.com/questions/599579/campus-network-design-firewalls

In this diagram, the firewalls are located between the outside world (border routers) and the core switches to provide a boundary defence, in what often gets termed the “castle and moat” security model, aka perimeter security.

Source: https://medium.com/devopsiraptor/trust-is-a-vulnerability-the-zero-trust-security-model-8c3bbdcd76bf

I’m not going to go into why Perimeter Security is Dead as a security model, because my colleague Jonny O’Connell has already written this fantastic post exploring this very topic. Needless to say, I agree wholeheartedly with his views on the matter and have long considered identity to be the main threat vector facing organizations.

Google Cloud’s Firewall

So how does the Firewall in Google Cloud work? Well, firstly it is built into the software-defined networking layer Andromeda, which means that the firewall is completely distributed over the network with no single choke point or single point of failure if you share my cynicism. The way I personally visualize this is to think of the Google Cloud firewall as a control plane that manages every single firewall for every instance in your Virtual Private Cloud (VPC) network. Or to quote Stephanie Wong, in her post about firewalls:

“You can think of the GCP firewall rules as existing not only between your instances and other networks, but between individual instances within the same network.” — Stephanie Wong

This granularity actually gives a lot of freedom to do things such as have a very flat network with subnets that are far larger than is typically found on-prem. I think this is great as I want to treat IPs as cattle, not pets, and don’t want to care about how many I have left in a given subnet.

I want to treat IPs as cattle, not pets — Ali Grew (July 2022)

Creating firewalls rules in Google Cloud is also really straightforward such as this example:

Source: https://stephrwong.medium.com/protect-your-google-cloud-instances-with-firewall-rules-69cce960fba

Not to quote Stephanie’s post word for word, but a firewall rule is formed of 4 things:

Action — Allow/Deny and Ingress/Egress
Protocol — TCP, UDP, ICMP, IPIP, or a combination of these
Ports — For example, 22 above
Source or Destination — This can be:
— An IP range
— Network Tags
— Service Accounts

The first three things are pretty standard for anyone who has ever defined a firewall rule before but things start to get interesting with the source or destination, specifically when you can use Network Tags or Service Accounts. Why might you want to use these, and how does it help in specific scenarios?

Let’s take this diagram for example:

Source: https://cloud.google.com/vpc/docs/firewalls

To enable communication from VM 4 to VM 3, the tags of both servers are specified — so what, you may ask? I could have achieved the same result by specifying their IP addresses, and you would be quite right. There are two distinct advantages to this approach though:

I don’t want to care about IP addresses, they should be treated as cattle not pets.
If VM 4 or VM 3 were actually managed instance groups (MIGs) with several VMs handling a given service, with IPs allocated from a pool.

Being able to specify the service account works in a similar way to tags. Every service that calls an API in Google Cloud has to have an identity, Compute Instances being no exception as per the example below:

Source: https://cloud.google.com/vpc/docs/using-firewalls#serviceaccounts

Here the firewall rule can simply be defined (using the gcloud CLI) as:

Firewall Insights

One very useful feature with Google Native firewalls is that you can gain “Firewall Insights” through the Network Intelligence Center tool. To quote the documentation:

With Firewall Insights metrics, you can perform the following tasks:
1. Verify that firewall rules are being used in the intended way.
2. Over specified time periods, verify that firewall rules allow or block their intended connections.
3. Perform live debugging of connections that are inadvertently dropped because of firewall rules.
4. Discover malicious attempts to access your network, in part by getting alerts about significant changes in the hit counts of firewall rules.
With insights, you can perform the following tasks:
1. Identify firewall misconfigurations.
2. Identify security attacks.
3. Optimize firewall rules and tighten security boundaries by identifying overly permissive allow rules and reviewing predictions about their future usage.

Now all of that sounds pretty useful to me in helping to identify potential attack surfaces, and the attacks themselves.

Firewall Logging

Firewall Logging is a powerful feature that when turned on gives deep insight into traffic. I will let you go and read the detail here, but in summary, you can see traffic is allowed, denied where it is coming from and where it is going (or attempting to go if blocked). Once it is in Cloud Logging you can also do whatever you like with it, like sink it to Big Query, Cloud Storage, or out to a SIEM tool via PubSub.

Firewall Appliances on Google Cloud

So after that brief tour of both on-prem firewalls and the Google Cloud firewall, we get to where these two worlds collide with virtual firewall appliances on Google Cloud and it ain’t pretty.

Firstly I was to state that this section has been formed from my experiences and opinions deploying one of these solutions. I do also have a collegue who deployed a different solution and ran into similar challenges. These challenge are despite the appliances being from two different highly prominent networking vendors so I doubt the problems are down to a specific vendor’s implimentation. Certainly, I found the vendor I worked with to be nothing but helpful in getting their solution to work but the fact I had to contact them to troubleshoot issues doesn’t really bode well.

So without further to do say hello to the HLD for an active/passive reference architecture from one of these vendors:

If you are looking at that diagram and thinking that is an awful lot of infrastructure, you would be right. Even if you ignore one of the two external TCP/UDP LBs you have:

An external TCP/UDP LB
An internal TCP/UDP LB
Cloud NAT
Five External IP Addresses (One for Cloud NAT and two each for the HA appliance pair, assuming external IPs for the management interfaces)
Four, yes four VPCs (as each NIC of the VM must reside on a different VPC)

For comparison let’s look at an equivalent Google Cloud-native architecture (assuming IPSec VPNs):

In this architecture, you have far less going on for the same functional benefit. It is comprised of:

A single VPC
A HA Cloud VPN Configuration (which would also require a Cloud Router I unfortunately omitted)
Cloud NAT for outbound routing
Ingress using Cloud Armor (for WAF) and a Google Load Balancer (admittedly this assumes HTTP traffic but you could swap this for an External TCP/UDP LB which would still offer some DDoS protection)
Cloud IDS for intrusion detection.
Cloud Logging (with optional export to a SIEM tool like Splunk)

Anyway back to the appliance-based solution, there were some quirks in the configuration that were worth noting:

In a multi-NIC scenario, NIC0 has to be the externally facing NIC as all load balancers except the internal TCP/UDP LB route traffic to this NIC.
You generally require 1 vCPU per NIC up to a maximum of 8 NICs. The only exception to this is a single vCPU machine which can have 2 NICs.
In one of our environments, we required more NICs (3 separate internal VPCs) for a total of 6 VPCs. This required 6 vCPUs which was no problem but the software was only licensed to use 4 vCPUs so despite paying for the additional two cores they will never be used.
In order to route to a Cloud SQL instance with an internal IP address, (which for those who don’t know is peered in from a Google Managed VPC) you have to manually (click-ops style) export the custom routes to the Google Managed VPC otherwise it doesn’t receive the default route out via the internal LB to the appliances and beyond.
In order for the firewall to work correctly Source NAT (SNAT) was disabled on the appliances with IP Forwarding enabled on the GCE Instances. This ensures that packets traversing the firewall have their source headers maintained. This does however also mean that the Google Cloud Firewall rules have to be completely open, and by open, I mean open specifically it means ingress and egress allow all on any port or protocol, which feels all levels of wrong from a security perspective. It would be so easy without the appropriate organization policy for someone to stick an external IP on a compute instance allowing hackers to breach your network.

Source: https://memegenerator.net/img/instances/74155082.jpg

So after all those configuration challenges does the solution work? Well, not very well but let me expand:

Appliances make the network fundamentally less resilient

Google’s distributed network design and firewall implementation has no single point (or points) that traffic has to flow through. This fundamental design choice is to ensure maximum availability and prevent choke points. By adding an appliance you are going against this philosophy and forcing traffic through a predefined path.

Further to the above we also found in the event of an upgrade or an outage it took about 30 seconds to failover between appliances, which could be inconvenient if it were to occur at a bad time.

We have to maintain the appliances

Who really wants to be up at silly o’clock upgrading firewalls? I have been there, got the T-shirt, and would much rather Google SREs worried about this.

They have to be managed separately from the rest of the infrastructure

As far as I am concerned all production infrastructure in Google Cloud should be managed using Terraform (using well-used and supported providers!) or KRM not by logging into appliances to make configuration changes. This is before I even touch on the declarative vs imperative management styles.

Source : https://www.reddit.com/r/ProgrammerHumor/comments/uulmq0/who_says_configuring_firewall_is_obsolete/

Troubleshooting has been painful

With high availability especially there will be a lot of moving parts breeding complexity. With complexity comes difficulty in troubleshooting which we have already found to be the case, with issue resolution becoming a cross-discipline (and team) endeavor where a single pane of glass would be most welcome.

Integration isn’t great with other Google native tooling

If you stay without the Google ecosystem of services there is a reasonable (albeit not perfect) level of integration. If you were to create an open firewall rule for example it would be quickly highlighted (and potentially alerted on) within the Security Command Center. Want to use the Network Intelligence Center Google offers? Using an appliance will reduce the visibility and usefulness of that tool.

Source: https://memegenerator.net/instance/84719680/data-laughing-i-thought-i-heard-you-say-seamless-integration

Appliances are somewhat expensive

Google VPC firewalls are FREE! Yes, I appreciate VPCs have some associated costs but you would pay them even if you were using an appliance. With an appliance, you are going to have to pay for GCE compute cost (keeping in mind the 1vCPU to 1 NIC ratio) and whatever licensing cost there is. Anecdotally, I also believe appliances will take more time to manage, configure and maintain leading to higher indirect costs.

Is the more sophisticated functionality required?

Firewall Appliances are often what gets referred to as “Next-Generation Firewalls” (NGFW). This generally means they likely contain additional functionality like IDS, IPS, and WAF. You might quite rightly point out that Google Firewalls don’t have this advanced functionality and you would be correct. However the story doesn’t end there, let’s take each of these in turn:

For IDS, Google now has its Cloud IDS service which is based on technology from Palo Alto. This works by packet mirroring from your VPC into a Google Managed VPC containing Google-managed Palo Alto appliances that inspect the traffic looking for threats. If threats are detected this is fed back into Cloud Logging from where it could be passed to a SIEM tool or alerted on.

Source: https://cloud.google.com/blog/products/identity-security/how-google-cloud-ids-helps-detect-advanced-network-threats

For WAF, Google has its Cloud Armor service, which is DDOS and WAF protection at the Google scale with it being deployed in a highly distributed fashion on the Google Front End (GFE). I think Priyanka sums it up really nicely in one of her famous sketch notes:

Source: https://thecloudgirl.dev/CloudArmor.html

For IPS, there is no Google service admittedly, and for the very simple reason that IPS requires the ability to block traffic which in order to be effective requires it to reside on the “data plane”. This however brings us back around to the resiliency problem I talk about above with passing all traffic through one or more defined points. What I will say though is that for front-end attacks (which is likely the main vector) Cloud Armor would provide protection, and chances are Google would be seeing it and reacting to it as least as fast as the appliance vendor.

Concluding Thoughts

When I set out to talk about this topic I wasn’t expecting to write a 2300-word essay, but here we are, and congratulations for getting this far. I hope in this post I have explained some of the benefits of Google’s Native VPC Firewall, some of the challenges with appliances, I experienced firsthand, and finally some of the alternate solutions to some of the additional functionality offered by NGFW like IDS, IPS, and WAF. As I said at the beginning I plan to make a little series about this, if you would like to suggest which area I should cover next please leave a comment. If you disagree with my opinions I’m also really interested to hear your thoughts.

All that leaves me to do then is to say that until next time keep it Googley ;)

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!