AWS VPN High Availability

There are other approaches, that use two servers per VPC (OpenVPN or OpenSWAN/StrongSWAN etc) connecting to remote VPN servers. High availability (HA) is implemented using a script to monitor and swap routes in the route table. That might be cheaper to implement. You could also use Amazon’s Direct Connect etc. Inbound HA can be solved by using two or more servers. But Amazon does not provide an easy way to do outbound HA. This approach addresses outbound HA.

I used AWS Virtual Private Gateway (VPG)/VPN functionality. Amazon charges approximately $0.05 per hour per VPN connection. So this is relatively cheap. Amazon VPN’s come with two tunnels (you can use only one if you want). Amazon uses two tunnels to do HA on their side. But the other end of the tunnels have to terminate in the same customer machine (both tunnels are tied to same public IP). Refer to the following diagram. For simplicity, I also represented two VPN tunnels with a single line. I have one public and private network in each VPC. There could be more, depending on number of availability zones etc. I also chose hub-spoke model where multiple VPCs connect to a central VPC. This might not be ideal depending on how traffic flows. The arrows in the diagram are used to indicate who is initiating the VPN connection (not the direction in which traffic flows). Traffic always flows out through Amazon VPN’s.

Simple setup with 3 VPC’s

I chose VyOS because it is free and easy to setup. I used version 1.1.3. If you have to use a older version, use 1.0.5 (which is available as EC2 AMI). Version 1.1.0 (which is the latest AMI version) has a nasty bug (http://bugzilla.vyos.net/show_bug.cgi?id=358). See instructions at the bottom on how to upgrade VyOS to 1.1.3.

To start with, setup two VyOS instances in each VPC public subnet. VyOS should have a public IP address. Depending on your bandwidth requirements, pick the right instance type. Larger instances get more bandwidth. When I was testing this setup, I used t2.micro. But you might want something like m3.medium or better. VPG/VPN traffic is not throttled (according to Amazon). Replace the public IP addresses with Elastic IP (EIP) address. If for some reason, the VyOS instance goes away, it will be a hassle to replace it. Elastic IP addresses make it easier to replace VyOS instances. Disable source/dest checks for all VyOS instances. And you might also want to enable instance termination protection.

When creating VyOS instances, create a new security group and use it. Start with wide open permissions. Once the configuration is done, don’t forget to tighten the security group. Keep SSH (tcp/22) and NTP (udp/123) access. For BGP, IPSec etc, instead of opening individual ports, I gave all traffic access to the VPN tunnels public IP addresses. If you have Network ACL’s, don’t forget to tweak them.

Once the VyOS instances are up and elastic IP addresses assigned, create Virtual Private Gateway (VPG), Customer Gateway (CGW) and VPN connections. Attach the VPG to the VPC. I used dynamic routing for my purpose, with private BGP ASN 65000. CGW addresses should be the VyOS EIP, which are the other end of the tunnel. I used the above diagram as reference to figure out which VPN is connecting to which VyOS EIP.

Select each VPN connection and download the configuration (Vendor: Vyatta). Vyatta configuration is compatible with VyOS. The downloaded configuration cannot be used as-is. The downloaded file has duplicate information because it includes configuration for two tunnels.

Figure out common configuration for all VyOS instances. Use a tool like Cluster SSH to apply it on all instances:

When configuring the hub VyOS instances, you will need to repeat the following steps multiple times (depending on the number of spoke regions)

Next configure the interfaces on each VyOS instance. It might be useful to update the description. Like ‘Oregon to Virginia Tunnel 1'

In the site-to-site section, local-address will be set to the EIP. VyOS will not like that, because it does not know anything about the EIP. Change it to the local eth0 address. And apply the site-to-site configuration:

Next configure the BGP protocol:

In my setup, I also changed the ntp servers and the hostname:

Finally, configure the routes/networks BGP will advertise to the other end. For all the spoke VyOS instances, they just need to advertise their VPC subnet. For BGP to advertise the route, the route should be in the routing table. Amazon instances only get a route for their subnet and not the entire VPC. If you check the output of show ip route, you will see a route for the VyOS subnet. Add a static route for the entire VPC. The follow example assumes you have a 10.X.0.0/16 VPC:

Tell BGP about this subnet so that it starts advertising:

In my setup, I also used inbound and outbound policies on which BGP routes will be advertised/received. For the spoke VyOS instances, something similar to the following. Where 10.X.0.0/16 is the local spoke VPC subnet. 10.Y.0.0/16 is the hub VPC subnet.

That is it for the spoke VyOS instances. Commit the changes and backup the configuration. And keep a copy of the configuration somewhere safe (not on the VyOS instances).

From the backed up configuration file, it is better to remove sections that are specific to the VyOS instance. This way, the configuration can be merged easily when instances need to be replaced later.

You can refer to VyOS documentation Wiki, but some commands I found useful:

Once the spoke VyOS instances are up and running, the VPC VPN in the hub regions should light up. You should see that tunnels are up and 1 BGP route is received

There are couple of ways to configure the Hub VyOS instances. You will have to apply all the spoke VPC configurations on the hub VyOS instances. The only differences are in the policy and route sections. You can redistribute the individual BGP routes to all spoke regions. But I chose to distribute one Class A subnet instead:

As for policies, the OUT policy will be similar to above except it uses a Class A subnet

For IN policies, each spoke region will be advertising a different Class B subnet. So create IN policies for each spoke region:

Identify the pair of bgp neighbors (for the IN policy) and apply the inbound and outbound policies:

Commit, save, backup and exit

At this point, all VPN tunnels in all VPC’s should be green. And they should be receiving exactly 1 route. Modify all the VPC route tables and enable route propagation. At this point, all instances should be able to reach other instances irrespective of which VPC they are in.

If it is necessary to replace a VyOS instance:

To upgrade the VyOS instance. SSH to the instance and run the command:

There are 4 tunnels from each spoke VPC to the hub and vice-versa. If one VyOS box dies, traffic will start flowing through the other one. Start ping from an instance in spoke1 VPC to another instance in spoke2 VPC. While this is running, reboot VyOS1 instance in Hub region. You should see minimal disruption. Once the VyOS1 box comes up, reboot VyOS2 in hub VPC, traffic should fail over appropriately.