Our transition to SSH certificates

Published in

payu-engineering

8 min readJun 9, 2021

In this blog post, I would like to share the story of how ZOOZ transitioned to using SSH certificate authentication for better security, usability, and management.

Password authentication and public-key authentication are the most common SSH authentication methods:

Password authentication is the default SSH authentication method and the least secure of all to protect your servers from attack or identity theft. Problems associated with password authentication include the practice of using insecure (easy to remember) passwords, storing passwords on disk, never rotating passwords, as well as various other pitfalls. Password authentication is also inconvenient with scripting.
SSH key authentication relies on asymmetric encryption using a pair of public and private keys, in which the client holds the private key and the server holds the public key. It’s more convenient with scripting and more secure in terms of dictionary attacks but still suffers from many of the disadvantages of password authentication.

SSH key authentication is the authentication method that we used at ZOOZ until recently. The flow of distributing the user’s public key was as follows:

Users generate a key pair for themselves.
Using a Terraform process, the public key is uploaded to an S3 bucket and placed in a folder that represents a role/group.
An operation engineer runs an ansible playbook to give engineers of a role/group access to a dynamically/statically calculated host inventory.

What are the problems with this method?

From an Ops perspective, there is a need to provision keys whenever a new employee begins working or to drop the key when someone leaves the organization. We manage a large number of servers and keys. The process of managing SSH authentication is mostly done manually and has several drawbacks:

Each new employee has to manually create an SSH key pair when joining a specific team.
Public key files are managed in an s3 bucket.
Keys never expire. Neither users nor hosts are forced to refresh their keys.
The employee offboarding process becomes a hassle.
There’s a risk of key exposure, key reuse, and theft of discarded keys.
Host verification never really happens…

It’s important to note the most if not all the pitfalls can be overcome with extensive automation and management. We believe, however, that most companies will not take that route. And neither did we. So instead of fixing the issues with public keys authentication, we chose a better way.

What impressed us when reviewing possible solutions, is that most of the shortcomings can be solved by using the SSH certificates authentication method with proper tools and a bit of automation (no extensive automation needed).

SSH certificates

SSH certificates are an enhancement to the public and private key authentication method.
The concept is to create a trusted server called a Certificate Authority (CA), which is used to sign the host and user identity. The result of the signing process is a so-called certificate. Then all that is left is to configure clients and hosts to trust any certificates signed by our CA.

The certificate itself is a data structure signed by our CA that contains specific information, such as the identity of the individual who wants to access the target host, the CA that signed the certificate, and the certificate’s validity period.

When using certificates, user keys are not stored in the server’s “authorized_keys” file, and host keys are not stored in the Client’s “known_hosts” file. Instead, both the host and the user hold their own certificates. The certificates are then exchanged during the SSH handshake. By default, during the SSH handshake, the SSH daemon will allow usernames listed in the user certificate’s principals field to sign in. Likewise, SSH expects to find the target hostname in the host certificate’s principals field.

So what are the advantages of SSH certificates? In summary:

Eliminates keys management — creation/deletion/rotation/distribution
Eliminates operational constraints associated with hostnames.
Certificates are valid for one workday

Step CA

We examined several tools for managing SSH certificates in terms of the features they provide, deployment complexity, security, management efficiency, and more. In this blog post, I will not discuss the considerations regarding each tool we reviewed. We believe that the most common tools can do the job.

“step-ca” is an open-source tool provided by Small-step, that lets you run your own private certificate authority for managing TLS certificates. “step-ca” is the server counterpart to the “step CLI” tool. The “Step-CLI” tool makes it easy for users and hosts to get certificates from “step-ca”.

Both step-CA and step-CLI give us the possibility to build our own certificate authority and have other benefits, such as:

The ability to get a host certificate automatically at startup through a simple automation step that will run in the ec2 user data.
Support for cloud instance identity documents (IIDs), which offers yet another security advantage.
Integration with our SSO provider (Okta).

This greatly simplifies and secures the authorization process around certificate creation and renewals.

Our implementation

We combined the open-source step-ca tool with an automation process we built for managing users and permissions on servers. This allowed us to have a central place where we could automatically manage SSH access to all our servers.

The client-side solution

The bootstrap for clients is very easy.

The client installs the step-ca tool on the laptop and runs the bootstrap commands. This needs to be done just once.

To renew a certificate, the client runs a simple command that will open a browser window in which the user can complete the authentication step through Okta. Once authenticated, the client will receive a certificate that is valid for one workday. Using the certificate, the client can now SSH into any server registered with our CA (provided that the user exists on that server).

The host-side solution

When launching a server, each new host is registered as a step-ca client by what we call the “CA-Init” script. The script is pulled from an S3 bucket and runs in the user data of the host. Implementing this is easy since most of our servers work in ASG with a launch configuration. For existing servers, we ran an ansible-playbook to execute the “CA-Init” script manually.

The “CA-Init” script does two main things:

It sets the systemD service for bootstrapping.
It sets another scheduled system service for user-management

Each of the services is scheduled, in case we make changes to the service configuration. And of course, the services are monitored and will send an alert in case of any failure.

-The bootstrap service

This is the first scheduled systemD service, defined by the “CA-Init” script.

This service script is a timed service that performs the initial bootstrap and registration of the host with the CA and defines an additional systemD service for renewing the host certificate once a week. The script is idempotent, which means if there are no changes (for instance, the version did not change or the CA fingerprint remained the same), then all tasks will be skipped. Of course, it always remains possible to execute changes if needed.

-The user management service

This is the second scheduled systemD service, defined by the “CA-Init” script. The service runs every half hour to update the users on the server. We manage which users will be created/deleted on our servers in a JSON file stored in our S3 bucket. The JSON file defines the teams that can access some CIDR blocks.

The deletion process is based on a text file also stored in the bucket that contains usernames that need to be deleted from the server (in case of user deletion, the first action would be to unauthorize the user in Okta).

Before the script runs, it pulls updates from the s3 bucket and if it identifies a change it will create/delete the relevant users. This solution gives us the ability to create or remove user access from all the servers within half an hour.

General diagram summarizing the bootstrap process of hosts with the CA:

With this complete solution, we now have a central place from which we can automatically manage SSH access to all our servers. The hassle involved with managing access keys has been reduced to zero!

Fallback

Since SSH is so critical in emergencies, we kept the old PKI process as a fallback. We also set up a slack alert for any non-certificate SSH access to any of our instances.

Difficulties we experienced along the way

The main difficulties we experienced during the implementation of the solution are related to the difference between the operating systems of our servers:

Packages as prerequisites: Pre-packages (such as AwsCLI ) that are needed to implement our solution require the bootstrapping and user management scripts to be very dynamic. The scripts also need to check various dependencies, such as required packages and their version numbers, whether the package management utility needs to reinstall specific packages, and so forth.
Non-support for SystemD: While deploying the solution, we found that we have some servers with old images that are supported by SystemV nor SystemD, which prevented us from being able to register them to the CA with our solution.

In retrospect, we would have saved a lot of time in preparing and deploying the solution if we had a standard AMI with a patch management automation solution. This would have spared the unnecessary struggle with these preconditions, as the packages that needed to be installed on our servers would be managed dynamically and automatically.

Future improvements

Well, the solution works great! However, some things can be refactorized:

Add more agile authorization capabilities, such as granting permissions for a narrower range of CIDR blocks, the ability to delete specific users from specific servers, etc.
Integrate with SSM (AWS system manager) and use it to move some of the execution of the process to a pull-based approach instead of push-based.
Build a standard AMI pipeline automation and use it to simplify the CA-Init script. We want the init processes to be fast and simple.

Summary

The transition to using SSH certificate authentication has been challenging, but well worth it. It is so much more convenient to authenticate and manage clients using SSH. And of course, we also gained additional layers of security and efficiency.

PayU is full of exciting technical challenges and our leadership encourages us to initiate and make an impact. I’m looking forward to the next challenge!
We will keep publishing our experiences and ideas, so stay tuned for our next blog post.