Build a cheaper, more flexible VPN solution on AWS with our open-source OpenVPN Certificate Authority
Today we’re open-sourcing our in-house OpenVPN Certificate Authority and management platform. Built on AWS with serverless technologies, it has proven to be a reliable, easy to use and secure platform. With this open-source release we hope it can be useful to others as well.
In this blog post I’ll explain the issues we ran into using a barebones OpenVPN setup, our rationale for building a custom CA, and a quick explanation of its inner workings. This is a fairly comprehensive overview of our solution. If you’d like to skip ahead at all, the contents breaks down as follows:
- a. The Certificate Authority
- b. Key Generation, Storage, & Rotation
- c. Certificate Storage
- d. The Certificate Authority API
- e. User Directory
- f. User Authentication & Permissions
- g. Frontend Web Application
- h. The OpenVPN Server Instances
- i. The OpenVPN Helper application
It is rather common for companies to have a Site-to-Site VPN or Direct Connect from their on-premise infrastructure to AWS. These VPNs can also be used by employees to directly connect to AWS instances while on the corporate network. However, this setup adds an obstacle for remote employees who need to get into the corporate network: to access it, they need to use another VPN.
At EmpathyBroker we have fully embraced the Cloud and no longer have any on-premise servers. Therefore, we decided to take a different approach to connecting to AWS. We got rid of our Site-to-Site VPN and set up a client VPN instead.
This greatly simplified our network design in terms of AWS, our offices, and the access controls. Most importantly, it removed the distinction between on-site and remote employees. Now, all employees are considered “remote”, even if they work at the office.
Employees no longer have to think about whether they are on the corporate network or not. It also makes it easier for administrators to give them access as there’s only a single centralized service they have to worry about.
2. OpenVPN as a Client VPN
OpenVPN is a great tool; it’s based on a strong and proven cryptography protocol (the same technology used on the web), it’s multi-platform, it’s very easy for users to get onboarded, it allows for granular access controls, etc. So it was a great choice at the time and served us fine over the years.
As the company grew in size, however, the maintenance of the OpenVPN servers and the Certificate Authority became an issue. More and more time was spent generating, signing and revoking certificates, not to mention updating the servers. This added upkeep can lead to mistakes and less-than-secure choices (generating private keys for users, long certificate expirations, etc.)
We knew we could do better. We looked for alternatives but couldn’t find anything that fit all of our needs, so we decided to improve the weakest link in the chain: the certificate signing process. If we could automate this in a secure way, we could greatly improve both security and ease-of-use, while decreasing Ops workload.
3. Design Decisions
So, what did we have in mind while we were building our own OpenVPN Certificate Authority? The following were the most important factors that influenced our design decisions. The solution had to…
- Be secure by design, following industry best practices.
- Be self-service. Employees should be able to use the system by themselves.
- Have centralized permission management, using the existing company user directory.
- Be auditable. All actions and events must be logged for further analysis.
- Have automation for all processes. Once setup, the system should require minimum maintenance from the Ops team.
- Be cheap. While cost was not the main factor, we kept price in mind while designing and implementing the system.
4. How Does Our Solution Work?
In essence, it works like a traditional OpenVPN setup with TLS certificates. But the heavy lifting of managing the private CA and signing both client and server certificates is handled by a serverless application, making the setup completely automated from an Operations point of view.
Users authenticate on a web application using their corporate credentials. They are presented with their active certificates as well as options to securely request new certificates and revoke active ones. Upon requesting a new certificate, the system automatically signs a new one and returns it to the web application. The application assembles the final configuration profile by attaching the private key, and downloads it for the user.
The OpenVPN Server instances request a new Server certificate for themselves on boot, and then will renew it periodically. They also verify the certificate of the connecting client with the CA, and obtain the permissions of the user to set up the VPN routes.
The system was designed to be secure and fully auditable, with every action generating an event that can be logged and acted upon. It also automatically takes care of housekeeping, including rotating the private keys of the CA periodically and removing old certificates as they expire.
5. Architecture Overview
Below is a bird’s eye view of the system architecture. The following sections explain in more detail each of the components.
a. The Certificate Authority
The Certificate Authority is the “brains” of the system. For security reasons, it lives on an isolated AWS account. It consists of:
- Secrets Manager Secret containing the Private Key for the CA along with the CA Root Certificate and OpenVPN Static Key.
- Secrets Manager Secret containing the Google Service Account Key used for querying the Directory (this is the Google Cloud equivalent to an Access Key ID + Secret Key pair for an AWS IAM User)
- DynamoDB table that stores all certificates that have been signed by the CA and are still valid.
- Lambdas for the Client and Server APIs, Authentication, key rotation, event processing, etc.
- API Gateway to tie all the Lambdas together on a public API.
- S3 Bucket with the Web Frontend release code.
- CloudFront distribution with both the S3 Bucket and API Gateway as Origins. It acts as the main entry point to the system.
- SNS topic that receives all audit events of the system. It has Lambdas subscribed that process all the events, log them, and alert on suspicious activity.
b. Key Generation, Storage, & Rotation
The project supports two types of private keys: EC keys using the P256 curve, and RSA keys of 2048 bit length. The 2048 bit size for RSA keys was chosen due to storage limitations of AWS Secrets Manager, which has a limit of 7KB per secret value. Originally only EC keys were generated, but those are not supported on the mobile versions of OpenVPN, so the current default is to generate RSA keys instead.
Private keys for the CA are encoded in RFC7517 format (JWK) and stored along with a self-signed CA root certificate, a 2048-bit OpenVPN static key and, in case the CA key has been rotated, the previous CA root certificate and a cross-signed certificate. A complete CA secret payload in JSON format, including Private Key, OpenVPN static key and 3 certificates has a size of ~5,4KB.
A Lambda function enables automatic periodic rotation of the private CA key. The Lambda fetches the current CA key, generates a new one, and cross-signs the new CA certificate with the previous private key. This allows both Clients and Servers to validate certificates signed with either the previous or the next CA Private key.
c. Certificate Storage
All certificates signed by the CA are stored on a DynamoDB table. This includes the DER-encoded content of the certificate as a binary payload as well as metadata, including Serial Number, Key ID, Subject, Validity Period and Revocation Time. Private Keys are never stored, because the CA does not have knowledge of them when signing a certificate.
DynamoDB will automatically expire certificates using the ValidUntil field as a TTL. It is also configured to report all changes to a Lambda function using DynamoDB Streams: all additions, updates and deletions will generate an event that can be audited.
The DynamoDB table is used for the Web Frontend, to keep track of Revoked certs, and as an audit log. For a certificate to be valid it has to be signed by the CA, exist on the DynamoDB table, and not be flagged as Revoked. This gives an assurance against mis-issued certificates similar to that of Certificate Transparency Logs for web PKI certificates, because even if a rogue certificate is signed with the CA key it will not be valid unless it is also added to DynamoDB.
d. The Certificate Authority API
The only way to interact with the Certificate Authority is through a HTTP API provided by some Lambdas and API Gateway. This API is divided in two parts: the Client API and the Server API.
The Client API is used by the Web Frontend. It allows end users to request new certificates and to query and revoke certificates they have been issued. It also allows Administrators to list all issued certificates as well as to revoke any of them. The Client API uses a Lambda Authorizer to verify a JWT token issued by Google that identifies the user making the request, and fetches extra user information and permissions from the company’s Google Directory.
The Server API is used by the OpenVPN Server endpoints. It can sign server certificates and generate the OpenVPN Configuration file for the server, receives user connection / disconnection events, validates client certificates, and pushes extra OpenVPN configuration to connecting clients. It uses IAM authentication to limit access to the Instance Role the OpenVPN instances run with on a different AWS account.
e. User Directory
One of the main design decisions taken when developing the VPN system was to use the company’s already implemented User Directory. It should be possible for administrators to centrally manage all users and their VPN permissions from a single place.
This project uses Google’s Directory system. It was developed with GSuite in mind, but Google’s Cloud Identity product has been tested to work, even with their free edition.
f. User Authentication & Permissions
Users are authenticated with Google Sign-In using the OpenID Connect protocol. This method is shared by all internal and external applications used by our employees. If users are signed into their corporate email the sign-in process is seamless thanks to SSO, otherwise they can follow the same familiar Sign-In process they’re already used to. Google Sign-In also provides many security benefits, like enforced 2FA, login challenges in case of suspicious activity, full audit logs, etc.
Google Sign-In provides a signed JWT Token which is then passed to the APIs on every request as an HTTP header. The API validates this token and fetches user and permission information from the company’s user Directory for further verification / authorization.
Administrators can centrally manage permissions for users through Google Admin. Users can be limited to specific subnets and, for users with sensitive access rights, even limit connections to specific MAC addresses.
g. Frontend Web Application
After signing-in with their corporate credentials, users are presented with a list of certificates issued to their name. They have the option to request a new certificate or revoke their existing ones. In case they already have 2 active (non-revoked) certificates when they request a new one, the older of the two will be automatically revoked.
Administrators can see a toggle that will show the certificates for all users, and allows them to revoke any certificate for any user.
h. The OpenVPN Server Instances
These are the actual OpenVPN server instances that clients will connect to. These instances can be running on any AWS account, as long as the API Gateway has authorized the IAM Role for the instance to make calls to the Server API.
These instances run in an autoscaling group and are registered in Cloud Map for DNS discovery. The DNS record is aliased to a wildcard on the public DNS Hosted Zone to allow for quick updates and prevent DNS caches from interfering.
i. The OpenVPN Helper application
The OpenVPN Server instances need to communicate with the Server API to function. The Helper is a binary that is made available to these instances and takes care of signing the requests to the Server API and configuring the OpenVPN service itself.
One of the purposes of this binary is to obtain the OpenVPN Servers’ configuration file. This file includes a server certificate signed by the CA that the clients can trust, along with other pieces of configuration based on the environment. Once the configuration has been obtained and saved, the helper signals the OpenVPN process to reload its configuration.
Another purpose is handling all of the OpenVPN Servers’ events, like client connection / disconnection, certificate validation, etc. The Helper forwards information about the event to the Server API which makes the actual decisions. Then it communicates the OpenVPN Server process the result, so that access can be granted or denied.
The system was designed with security as a top priority. These are just some highlights of secure design decisions:
- Centralized management of users and permissions, integrated with the corporate user Directory.
- Isolation of the CA system on a separate AWS account, ensuring no-one may access the Private Keys of the CA or any other part of the system.
- Periodic rotation of the CA keys (every month) and low validity for client and server Certificates (1 month) following best industry practices.
- Secure private key generation. Private Keys for clients are generated on the browser using the JS Crypto API. They are never logged or transmitted over the network and are discarded once the configuration profile has been downloaded. The same applies to server Private Keys that are generated on the OpenVPN server instance and never leave it.
- Upon connection to the OpenVPN server the system validates the user again, including the good standing of their corporate account and that the certificate used to connect is present in the database for auditing.
- Granular permissions, limiting user access to network resources by subnet CIDRs. Administrator may optionally limit users with access to restricted networks (like a production environment) only to connections coming from a given physical MAC.
- All user and system actions raise events so they can be properly logged and audited.
As previously mentioned cost was not a priority, but a welcome side-effect. The following table lists all AWS resources used, and assumes no free-tier is in effect. All resources listed as $0.00 have a negligible cost which is not expected to ever reach $0.01 with normal usage.
- EC2 Instances (2x t3.nano Spot): $2.44
- EC2 Instances (2x 8GB EBS): $1.76
- Secrets Manager (2 secrets): $0.80
- Secrets Manager (API Calls): $0.00
- Route53 Hosted Zone: $0.50
- DynamoDB (On-Demand): $0.00
- DynamoDB (Storage + Streams): $0.00
- Lambda executions: $0.00
- API Gateway: $0.00
- S3 + CloudFront: $0.00
- SNS: $0.00
- Total: $5.50
Comparing this with the recently released AWS Client VPN: 50 users, 8 hours/day, 20 days/month on 2 Availability Zones would cost $544 (plus Active Directory costs if you don’t already have an AD directory). That means rolling your own can be 99% cheaper than AWS!
8. Is it production ready?
At EmpathyBroker we have been using this solution in production for the last 6 months with great results.
However, this first open source release is missing some parts that prevent it from running out of the box. These pieces are too specific to our internal architecture and processes for them to be useful on a public release.
Over the next days and weeks we will work on replacing those pieces with alternatives that can be used by the public at large, as well as writing some much-needed Documentation on the public repository. The end goal is that this VPN solution may be deployed by anyone with ease.
9. Show me the code!
You can find the public code on our GitHub repository: https://github.com/empathybroker/aws-vpn
We’d love to get feedback from the community on our design and implementation. We also welcome contributions on our open-source projects.