Advanced OpenVPN routing with AWS

Published in

RapidSOS Engineering

10 min readNov 9, 2020

How we utilized AWS Transit Gateway to implement routing features not supported by standard VPC Peering.

At RapidSOS, we implement “defense in depth” as a core information security principle. This means that securing critical assets requires implementing multiple layers of security rather than simply relying on a single strategy to prevent unauthorized access. One important layer in our stack is OpenVPN, which allows us to connect securely to our internal network inside our AWS VPC.

With the default configuration of OpenVPN, when your endpoint (ie. your laptop) connects to an asset (ie. an EC2 instance or Database), the source IP is the internal IP address of the OpenVPN server. It’s referred to as Network Address Translation (NAT), and behaves somewhat like a proxy server. For instance, in your EC2 logs you might see “10.1.1.1” (the internal IP of the OpenVPN server) which tells you nothing about the identity of the user accessing the resource. This loss of accountability is highly problematic, especially in security/compliance focused organizations such as our own.

In order to determine a user’s identity, what you need to do going forward is have the client endpoint’s source IP preserved in the logs, by directly reaching your destination subnet instead of using the default NAT setting. This is referred to as “routing mode”.

Let’s have a high level look at the key components of our setup:

Multiple AWS accounts: At RapidSOS, we follow AWS best practices and segregate our production and non-production workloads so blast radius and security perimeters are well contained. The VPN account sits at the center forming a “hub and spoke” type network.
Redundant (Active/Active) OpenVPN Servers: In the VPN account, OpenVPN Servers are running in us-east-1 and us-west-2 AWS regions with a pair of weighted Route53 DNS records pointing to both servers. A health check can also be used to direct traffic exclusively to one or the other in case of failure.
Laptop running OpenVPN client: Once the OpenVPN servers are running and you have a valid set of credentials, you need to install OpenVPN client configuration profile and OpenVPN connect client on a laptop.
A destination resource you are trying to connect to: The “production” account holds some resources that we are trying to hit over the VPN — e.g. EC2, DB instances, etc.
A method of connecting the two AWS accounts / VPCs: Let’s talk about this last item. We have two AWS accounts each with 2 VPCs that need to fully communicate with each other (as shown in the diagram below).

Figure 2. Desired network topology connecting VPN to production.

The problem with VPC Peering

Before we dive into the multi-account scenario above, let’s start with a simple example where both resources live in the same account with 2 VPCs. What would VPC peering look like in this scenario?

Figure 3. Traffic is routed successfully to the destination but fails on the return path.

Our OpenVPN server is running in the VPN account and we are trying to begin with routing between two VPCs in the same account. Things might just look perfect on the connectivity side when using NAT, but what happens when you switch OpenVPN to “routing mode”?

Now, when trying to reach your destination resources in production, the connection times out.

Why? Here’s what’s happening at a network level:

Your endpoint (ie. laptop) initializes a “utun” network interface with a 172.x IP address allocated from the DHCP server running on the OpenVPN server.
Your local route table ( on MacOS: “netstat -rn | grep utun”) directs traffic over this network interface to the VPN server in VPC 1 public subnet. Specifically, traffic hits the public network interface on the EC2 running openVPN.
OpenVPN running in VPC1 sends traffic to VPC2 (10.2.x.x). Packets are sent successfully because the route table in VPC1 sends the traffic over the VPC peering connection and reaches the destination server.
Now the destination server sends a response back to the 172.x.x.x ip address of the VPN client. The route table in VPC2 tells it to send that traffic across the peering connection to VPC1.
If VPC2 does not have a route table entry for 172.x.x.x the traffic is dropped. If you did add a route table entry for 172.x.x.x telling it to go over the peering connection then you’re good.
Here’s where it falls apart: when the traffic reaches VPC1. That VPC has just received traffic with a 172.x.x.x destination from the peering connection, but since that is not within the VPC1 CIDR, it won’t route it anywhere and the traffic is dropped.

In other words, VPC2 is trying to send traffic to the OpenVPN server in VPC1 but the IP address that it’s sending to is actually your laptop.

So why did this work before in NAT mode? Well, because the actual source of the traffic was the 10.1.x.x IP address of the openVPN server which is in the routable IP range on VPC1.

This simple test shows OpenVPN routing over VPC peering is not a workable solution for our use case. Additionally, we haven’t even thought about inter-region or cross account peering yet.

AWS Transit Gateway to the rescue

A transit gateway (referred to as TGW) is a newer AWS offering which acts as a regional virtual router for traffic flowing between your VPC and various networking connections (eg VPC, VPN, or DirectConnect). It’s really meant to consolidate all of your networking connections into a single hub, but the side benefit here is that routing to another VPC through a TGW operates at layer 3 of the OSI stack. This means that we can use static routing rules to direct traffic entering a VPC even if that traffic has a destination IP outside of the VPC’s CIDR, instead of dropping that traffic like a VPC peering connection would do.

What happens at the network level when we use a TGW?

(Same) Your endpoint (ie. laptop) initializes a “utun” network interface with a 172.x IP address allocated from the DHCP server running on the OpenVPN server.
(Same) Your local route table ( on MacOS: “netstat -rn | grep utun”) directs traffic over this network interface to the VPN server in VPC1 public subnet. Specifically, traffic hits the public network interface on the openVPN server.
OpenVPN running in VPC1 sends traffic to VPC2 (10.2.x.x), now using the TGW attachment and reaches the destination server.
Now the destination server sends a response back to the 172.x.x.x IP address of the VPN client. If VPC2 does not have a route table entry for 172.x.x.x the traffic is dropped. If you did add a route table entry for 172.x.x.x telling it to go over the TGW attachment, then you’re good.
The TGW attachment in VPC2 routes the traffic across the TGW per its route table. So the attachment in VPC2 routes it to the attachment in VPC1.
Here’s where the magic happens: The traffic arrives in VPC1 on the TGW attachment with a 172.x.x.x destination. In the VPC peering scenario this is where the packets are dropped, but because the TGW attachment lives inside of a subnet in VPC1, it can reference that subnet’s route table to route the traffic to the next hop. In this case, the next hop is the Elastic Network Interface (ENI) of the VPN server.

Figure 4. The route table entry in VPC1 which points the VPN IP range directly to openVPN ENI .

Let’s start with a simple example of TGW in a single account, single region. (We will build on this example in the next section).

We have an OpenVPN EC2 instance running in VPC1 and our resources (EC2/Database instance) in VPC2. A TGW lies between the two VPCs and all traffic is sent back and forth across TGW attachments.

For setting up a TGW, please refer to the AWS documentation on how to create and modify one: https://docs.aws.amazon.com/vpc/latest/tgw/tgw-transit-gateways.html

Figure 5. Simple VPC connection using Transit Gateway.

Real world example

Now let’s move on to a more real world scenario where we have a VPN in two regions for high availability and separate AWS accounts. Both east and west VPNs need to hit east and west production- basically a mesh (see image below).

The TGW exists within one region but supports inter-region peering. This is why we need to create another TGW in the us-west-2 region in the VPN account. The plan is to share these TGWs with the production account to enable cross account traffic across shared TGW attachments. We also need to peer the two TGWs to route traffic between them. This enables cross region traffic for high availability across TGW peering attachments. The CIDR for OpenVPN servers in two different regions must also differ because TGW doesn’t support routing between Amazon VPCs with overlapping CIDRs.

TGW in the VPN account is shared with the production account using Resource Access Manager (RAM).

Figure 6. One TGW per-region, shared with other accounts/VPCs

Deep diving into the routes tables and TGW attachments

The diagram below shows the CIDRs of the VPN account where OpenVPN servers are running and the production account with running resources. Comparing it with the single region scenario, we moved the VPC2 to a new account i.e. production account.

The need for Transit Gateway route tables

Each subnet has an associated route table which controls how traffic is routed within the VPC. This is standard AWS VPC routing. Let’s take a look at an example VPC routing table from the VPN Account in us-east-1.

Figure 8. Example route table for us-east-1.

The first rule says send this traffic to the local subnet.
0.0.0.0/0 is internet traffic and goes out to our Internet Gateway (IGW).
Send the others to the Transit Gateway.
Last rule says: Send the VPN DHCP range -> OpenVPN Network Interface.

Whereas a VPC route table is associated with a single VPC, a TGW route table is associated with one or more TGW attachments and forwards packets between those attachments. In a TGW route table the target for any route is a TGW attachment. To route traffic to a peered TGW in another region we need to create a TGW peering attachment and use that attachment as the target of a static route in the TGW route table.

This is how we want to set up our TGW route table in us-east-1:

Fig. 9. Destination CIDRs and their routing targets for a TGW route table.

An example will help us better understand how traffic flows over a TGW. Let’s go back to the TGW route table for us-east-1, and follow the journey of a packet with a destination in the 10.160.0.0/18 range:

The VPN server wants to send traffic to 10.160.0.0/18

VPC Route table says that for a destination in 10.160.0.0/18, send the traffic to the TGW attachment.
TGW attachment sends traffic to the TGW east.
TGW routing table for the us-east-1 TGW says that, for destinations in 10.160.0.0/18, send the traffic across the TGW peering connection to the TGW in us-west-2.
TGW routing table for the us-west-2 TGW says that, for destinations in 10.160.0.0/18, send the traffic to the TGW attachment in the production account of us-west-2.
The TGW attachment in prod us-west-2 is in the same VPC as the destination server and the VPC routes the traffic to its destination.

The return traffic follows each step in reverse until it gets back to the OpenVPN server.

Figure 10. Step by step traffic flow via TGW.

Security groups

Now that we are routing traffic from VPN clients, any AWS resource that needs to be available to users on the VPN needs to have new security group rules created to allow access for traffic from the 172.27.x.x addresses. This can be a large number of changes, depending on what resources and security groups exist in your AWS account.

For PoC purposes, in our test production account, our security groups were all created with rules allowing traffic from 10.0.0.0/8 addresses to allow for the unrestricted flow of private network traffic within some of our VPCs. (In a more secure, real world scenario, you would probably not want to open anything wider than 10.x/16). We carved out 10.30.0.0/20 to use for our VPN client addresses. Using the 10.30.0.0/20 range instead of the default 172.27.244.0/20 saved us the step of adding a ton of security group rules.

Conclusion

When attempting to route traffic between VPCs in AWS where the source IP does not exist within the CIDRs, you will need to rely on a strategy which supports true “layer 3” routing. Transit Gateway proved a viable solution as it provides “local” access by placing a network attachment directly inside each VPC which was a key differentiator from a traditional VPC peering strategy. This has implications not just for OpenVPN specifically, but can be a valuable tool for a wide range of proxying/IP forwarding strategies.

Advanced OpenVPN routing with AWS

Real world example

Additional Reading

Special Thanks

Written by John Mancuso