Flexible and Secure SSH access to cloud infrastructure — Part I
A couple of years ago we reviewed the way we manage our teams’ SSH accesses. We were looking for three key areas of improvement: service-level granularity, flexibility to give time-bounded accesses, and finally good traceability for auditing purposes.
TL;DR we decided to move towards certificate-based accesses with BLESS (by Netflix OSS), an “SSH Certificate Authority that runs as an AWS Lambda function and is used to sign SSH certificates from public keys”. In this first article, we detail why we made this choice as well as alternate solutions we dismissed.
Initial setup: Bastion & SSH public key
In this setup, access to our cloud infrastructure was done with an SSH tunnel through a Bastion which acts as a jump host by forwarding the connection to the target machine, located in our private network.
Within our private network, we have three kinds of services running on Debian OS images:
- Long-lasting services: our primary datastores and internal services are only accessible by the infrastructure feature team members and our on-duty team.
- Short-lasting services: all our other services have no record of previous interactions (Stateless) and can be replaced easily. For these services, developers’ personal public keys can be injected with Cloud-Config during deployment.
- EMR services: all our AWS-managed big data servers, these required to share private keys with developers.
These three different ways of providing access to services lead us to key management discrepancies.
Access with a public key served us well, however, we were facing too many limitations:
- It’s hard to manage public keys at scale, they must be added to provide access and deleted to remove accesses. We manage the primary keypair provisioning that we keep for ourselves but they were able to handle additional key provisioning within their services.
- If a member of a feature team is leaving the company, he could keep a copy of these keys, we had to be extra cautious to remove these accesses.
- There was no easy way to give and remove individual access to machines. If we wanted to change these permissions, this implied making changes in production. This required either a new deployment or a new Chef run (our configuration management tool).
- In a “least privilege” strategy, we needed to be able to grant time-bounded access to specific resources. Also, offboarding team members was complex.
As the team responsible for managing user accesses to our infrastructure, our requirements were:
- Secure access management by reducing the access scope with fine-grained permissions following the Principle of Least Privilege (PoLP),
- Simplify user management, as we don’t use an LDAP server, we had to find a way to manage user permissions with an RBAC (Role-Based Access Control) solution that should allow us to grant and revoke access quickly and easily,
- Improve traceability with easy access to logs for each connection from the bastion to the target service (who, when, from where and to which machine).
All candidate solutions should not introduce a regression in our current workflow, and all of our existing tools relying on the SSH protocol should be supported.
We also listed other requirements, such as:
- having a granular audit with all typed commands for traceability,
- support for MFA authentication,
- managing temporary access.
Main use cases
- Provide access to Alice for her team “foo” cluster and only this one
- Provide temporary access to Alice for the team “bar” cluster
- Offboard Alice
SSH accesses are very sensitive, so if the solution is not reliable and doesn’t work as expected, the risk of being locked out exists. Losing access to critical services such as our data stores is not an option. So we naturally looked at native SSH solutions, requiring only an SSH configuration on the machine. We studied two types of solutions:
- Packaged public-key solutions,
- SSH with certificate-authority (CA) solutions, to know more about certificate access and its advantages you can jump to the dedicated section.
Below, our benchmark of available solutions in 2019, they might have evolved since then.
Packaged public-key solutions
Packaged public-key solutions essentially provide: a centralized database for user permissions (most of the time in the Bastion) & rely on OpenSSH for access. We benchmarked the following solutions:
- Bastillon: it only offers SSH access through an emulated terminal in the web interface. In our case, this would be considered a regression.
- Wallix: is not a free solution and seemed to be an overkill for our use cases as it offers a range of features that we don’t need (e.g RDP sessions management).
- SSHportal: we dismissed it because its key management is painful, it requires generating SSH keys on the bastion and copying them manually on each machine (it doesn’t scale).
- Aker: we thought the underlying concept was interesting but in the end it is was not what we were looking for (access management is only done through the bastion, no JumpHost is possible), also we experienced several bugs during our tests.
- EC2 instance connect: having to install the ec2-instance-connect agent and having to manage permissions with IAM policies was a blocker for us.
- SSM Session Manager: is not based on the SSH protocol.
SSH CA solutions
Using a packaged public-key solution would imply keeping a public key mechanism that struggles to scale up. Hence, we decided to have a look at certificate-based options:
- Teleport: It seems to be one of the main actors for open-source SSH access solutions. Teleport uses its own user database, and also provides authentication with SSO. Access can be granted following a Role-Based Access Control (RBAC). The solution seemed very good. However, we were not keen on having to install and use an agent and not rely on the standard SSH mechanism. Finally, the biggest obstacle was the price, with a couple of thousand nodes the bill would have exceeded 10,000 $.
- Vault from HashiCorp: Vault is a secret management solution ticking all the major boxes with only one drawback considering our context: it uses its own database for user management. As we already use Secret Manager from AWS, having a Vault service running and adding another database containing all permissions for each user was overkill. We wanted to keep a centralized place to manage user permissions with AWS Identity and Access Management.
We found this talk, from HashiCorp’s Senior enterprise architect Erik Rygg, really useful during our test.
- BLESS from Netflix OSS: BLESS is an SSH Certificate Authority that runs as an AWS Lambda function. It meets all our criteria plus it offers the ability to generate certificates based on IAM permissions. Leveraging a serverless service has cost advantages and overall it seemed like the best candidate during our benchmark.
Main candidates comparison
Among these candidates, Teleport required to install an agent on the machine, which is not conceivable. Vault and BLESS are based on basic OpenSSH configuration for certificates, which is exactly what we expected. However, Vault requires creating a new user database (acting as a CA) to sign certificates, while BLESS relies on IAM users from AWS.
BLESS was the best Certificate Authority considering our requirements. It must also be noted that we do not rely on any LDAP server, so the only options we have to manage users’ authentication are GSuite or IAM from AWS.
Given that BLESS has the ability to look for an IAM user’s membership for a group to define its permissions, it perfectly fits our environment. And since BLESS is open source, the only cost to consider is the additional Lambda processing.
SSH authorized_keys vs Certificates
We saw the potential of SSH certificate-based solutions during our benchmark. Below, an explanation of the key differences between public key and certificate-based access.
SSH with a public key authentication is the basic way to SSH in servers.
With this approach, you generate a key pair (a public one and a private one), add this public key on the
/home/user/.ssh/authorized_keys file for the user you wish to grant permission. And tada, you have access.
With a lot of servers, it can be painful to manage and it doesn’t scale well: How to easily add a public key? How to easily remove a public key? How to update a public key? What is the real permission scope for a user?
SSH CA solutions are used by many tech companies operating at scale. I particularly recommend reading the Scalable and secure access with SSH post by Marlon Dutra on Facebook’s engineering blog, which succinctly explains how to implement a basic CA solution.
In brief, the SSH protocol provides native functionality that works with the use of certificates to authenticate servers and clients. They are signed/generated by a Certificate Authority (CA). Using this system, you can authenticate to a target machine.
However, these OpenSSH certificates do not follow the X.509 standard used in TLS SSL for storing SSH keys. A direct consequence is that the Certificate Authority keys never expire.
Your certificate holds metadata such as validity duration, principals, and Key ID which by themselves solve a lot of our problems.
- Validity allows us to grant time-bounded accesses.
- The principals are tags that define the access scope. If the target machine contains one of the principals present in the certificate then the user will have access to the machine.
- The Key ID is a “key identifier” that is logged by the server every time the certificate is used for authentication. It can be any value you want and thus can be used to give any context about the session.
SSH with certificates is a good solution to solve problems induced by public key authentication at scale.
During this study, we were able to highlight that we needed to change our paradigm for SSH access management. We decided to use a native SSH-certificate solution as it solved all of our pain points.
Note: If you don’t need native SSH, new solutions exist such as Boundary
The use of principals allows us to secure access management if it’s coupled with fine-grained tags. It simplifies user management as we can follow an RBAC relying on our existing IAM users database. Thanks to the
Key ID field we improve traceability with context on the initiated sessions.
In the second part, we will detail how we implemented this solution for thousands of servers and what Principals strategy we used.