Create a high availability NAT in Google Cloud Platform while you have a coffee
If you want to know about Cloud NAT, the new fully managed NAT service now in beta (this article applies until Cloud NAT will be in GA), please visit: https://medium.com/bluekiri/high-availability-nat-gateway-at-google-cloud-platform-with-cloud-nat-8a792b1c4cc4
When I started to read about Google Cloud Platform, I was surprised because I couldn't find anything about NAT gateways. In the last years, I have been working for an online travel agency and we always have to deal with our providers to open their firewalls in order to setup the integration with their system. Besides, lately I have been using Amazon Web Services and it has the Internet Gateways service that offers a NAT gateway as a service and all of the maintenance is performed by AWS.
NAT gateways usually have two main purposes:
- For security, you don’t expose the servers behind the NAT to the Internet.
- For maintenance, you only have to control a small number of IPs.
But finally I found a link in GCP’s documentation about this concept (https://cloud.google.com/vpc/docs/special-configurations#multiple-natgateways) and I decide to automate it.
I have to say that I am a big fan of Deployment Manager (https://cloud.google.com/deployment-manager/docs/), the infrastructure as a code tool for GCP that you can use to create the infrastructure related with a project and when you have finished the project, you could delete all associated resources with the project in one line. You can write your templates in Jinja or Python (this one is recommended) but I’m going to use Jinja because of it more simple to read the code. You could see the code in our Git repository (https://github.com/bluekiri/gcpnatha).
The configuration diagram is shown below:
First of all, we are going to create the configuration file:
- path: ./startup.sh
- path: nat_ha.jinja
- path: netw.jinja
- path: fw.jinja
- path: hc.jinja
- path: tmplvm.jinja
- name: nat_ha_setup
We need to import the template files, the startup script to set up the health-check and our parameters:
- vpcName: the name for the new custom VPC
- ipRange: the CIDR range for the subnetwork for the new VPC
- region: the desired region where we are going to create our resources
- zone[1–3]: the different zones inside the region to create our high availability solution
- sshAccess: a good practice is to restrict the SSH access to our servers and only open from our facilities (0.0.0.0/0 only for testing)
- machineType: it’s important not only for the price of the instance; you have to choose the correct machine type because according to the Compute Engine documentation: “Outbound or egress traffic from a virtual machine is subject to maximum network egress throughput caps. These caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine”.
To avoid you getting bored with the code, I’m going to explain quickly the main parts (you could see in more detail in Github):
- nat_ha.jinja: it’s the main template file and we define the our custom properties (you could read properties from environment variables too). We call the other templates from this file
- netw.jinja: create the vpc and subnetwork
- fw.jinja: create the firewall rules to permit SSH to our solution, http access to the health-check from Google’s servers and permit communication between instances inside the same subnetwork
- hc.jinja: create the health-check
- tmplvm.jinja: we reserve an external and internal IP for the NAT servers, create the instance template and the instance group (with autohealing) and finally we create the route to the NAT server.
You can clone the Github’s project in a computer with the Google Cloud SDK, and type the following command to create your infrastructure (review your custom properties):
gcloud deployment-manager deployments create test-nat-ha --config nat_ha.yaml
To verify our solution, create a test instance without Internet connection inside the created subnet in one of the same zones that one of the nat servers (use the network tag “no-ip”).
Connect to the test instance (you have to use a bastion instance or in our cae you could jump from NAT servers) and check the public IP:
You could see in the output that the public IP is one of the reserved public IPs of the NAT servers and it changes in every execution.
You could use traceroute tool to see the route path:
The final verification is the a SSH connection to a NAT server and kill the Python script to force the failure of the health-check and you could see that the instance group remove the instance and create a new one.
To end the test, we are going to remove all the resources with the following command:
gcloud deployment-manager deployments delete test-nat-ha
Have a nice cup of coffee…