Why Zero Trust is More than an SSO Portal

Published in

Version 1

11 min readNov 29, 2023

Many companies have been integrating zero trust networks since the release of the Google BeyondCorp white papers in 2014. Indeed in America the federal government is mandated to advance towards a zero trust architecture. Whilst the exact definition of a zero trust network differs between providers, here the Microsoft Whitepaper “Evolving Zero Trust” definition will be used, which states zero trust networks “verify and secure every identity, validate device health, enforce least privilege, and capture and analyse telemetry to better understand and secure the digital environment”. Identities in this case can often be split into three distinct but related parts — Human Identity, Device Identity and Workload Identity.

In many cases you may not be directly in control of all these identities, for example in the cloud where you must rely on a cloud providers definition of your machines identity and where it resides.

Figure 1: A diagram illustrating the various components of a Zero Trust network. Source: https://www.microsoft.com/en-gb/security/business/zero-trust

There are many services and many examples across the internet of how to implement human identity verification and least privilege access within that space (often using OAuth combined with OIDC or alternatively Active Directory), but whilst some service accounts will be created for workload the same level of scrutiny is not applied to either device identity (sometimes also referred to as an endpoint) or workload identities and fewer implementations still will validate device health (sometimes also referred to as a Device Posture Check). We’re going to explore these often neglected aspects of zero trust environments in more detail here.

Attestation and the Trusted Platform Module

Attestation is defined in the Cambridge Dictionary as “a formal statement that you make and officially say is true”, in our case it’s a matter of being confident the device or workload is who it claims to be. In other workloads i. In many cases, these claims are made with chains of trust backed by trusted root certificates — this can be validated in the form of Mutual TLS Connections (mTLS) when forming a connection between devices or through protocol and request specific methods such as JSON Web Tokens (JWTs).

The introduction of Trusted Platform Module’s (TPM’s) onto hardware over the last decade (Mac’s do not have TPM’s but do contain a T2 chip with equivalent functionality) means that there is a secure place to store keys and secrets on dedicated tamperproof hardware (even if the operating system is hacked).

As an aside even outside of the Zero Trust world modern VPN’s can also take advantage of TPMs (although the detail of this won’t be elaborated in this blog post). This still has advantages compared to traditional methods because to access the VPN the attacker will need to have access at the time of request to the compromised device as the key cannot be stolen from it.

For a zero trust use case the TPM can be utilised to generate a certificate (based off Endorsement Keys and Attestation Keys) that can be used to authenticate a device in a way that means the key cannot be stolen and traffic cannot easily be faked or replayed by an attacker. A further advantage of utilising this certificate-based process is that revocation of certificates when a device is known to be stolen or hacked can be easily controlled through Certificate Revocation Lists. The mechanisms of this are the subject of conference talks in themselves which we won’t try and explain, however if you want to read up more on this there’s a great presentation by google that can be found here and if you prefer to learn by code there’s a repository here.

User Device Identity

User devices typically comes in the form of laptops, mobiles and desktops and are used to access company resources which might be files, emails or websites. Parallels can be drawn between a device identity and a human identity because whilst each device can have a unique identifier associated with it (just like a human username), each device should also have multiple attributes that should be considered during authentication. Examples of useful attributes include the operating system on the device, whether that operating system is patched to the latest version and whether it’s corporately managed or a bring your own device (BYOD). In addition to these properties which are often relatively long lived, there can be more transient attributes such as the location of a mobile device and for this reason it’s important that just like human identity, device identity is short lived (typically 24 hours or less).

Traditionally, with a VPN, the device is validated at the point of joining the network but not afterwards. Whilst this allows the desired device posture checks to take place when joining a network, in a zero trust network this is not enough as the post checks should be different for different websites and on occasion might be different even on the same website. As an example, think of a SharePoint site where different subsites might have different risk profiles and therefore a company news subsite point might be available to all users on any device, but accessing a subsite that contains sensitive client information requires you to be on a corporate issued device. In a true zero trust network therefore it’s important to evaluate the device therefore on each request.

Therefore, as described above, a complicated dance utilising the device TPM is performed to create our short-lived certificate containing our device identifier which can be sent to servers with any request. Whilst the certificate can be evaluated at each services individual Layer 7 Load Balancer, an improvement at the enterprise level is to create a centralised identity aware proxy to all internal resources. Examples of services that provide an identity aware proxy include Google Identity-Aware Proxy, Cloudflare Zero Trust Platform, Oathkeeper and Teleport.

The identity aware proxy will perform TLS termination of the request, validate the device’s certificate, validate the human’s identity, and then add extra headers into the request to supply this information to end services (such as the device identifier from the certificate). For some backend services it may also do initial attribute verification for top level corporate posture checks. In the absence of a device certificate, for example because a user is on BYOD, this will also be communicated in a header. It’s of upmost importance that the proxy has high uptime and works fast (as it fronts all corporate resources) and is secure (because it terminates all HTTP requests to corporate resources and modifies them).

The advantage of this centralised proxy is that there is a standardised base level of trust assigned to all devices (and humans) that get through it to the backend services, avoiding some risk that comes with many different API gateways. It’s also useful to have a centralised, audited log of access control across the organisation in a single source. Whilst some might argue it’s a single point of failure, it’s important to note that it’s no more than that of a VPN but it is more secure by nature of the constant verification of each request.

An interesting side effect of the identity aware proxy is that there’s a strong desire to make as much communication as possible happen over Layer 7 compatible protocols. For example, in this 2017 (and now deprecated repository) Google decided to tunnel all SSH traffic over HTTPs so that it could be authenticated and authorised through the identity aware proxy. Whilst other mechanisms are available for queue-based systems where this isn’t possible it adds significant complexity into the system. One good example of how a similar based system was created for RabbitMQ can found in this EU Commission Research Paper).

Once the identity aware proxy has added any additional headers into the request (for example directly inserting the device identifier) it then creates an mTLS request into the intended destination service. It’s of course vital in this case that the service only accepts requests from the central proxy. The service can still make more nuanced access control decisions, for example like authenticating users’ access to specific data items within the service. The image below shows the authentication flow for such requests.

Figure 2 — A Request to a Hosted Application through the Identity Proxy including the device posture check. Later checks are discussed in more detail in the workload attestation section further on.

Where possible the identity aware proxy should also front any SaaS services created by the company, like how an SSO service for SaaS services more traditionally.

There are many mobile device management (MDM) options for managing corporate devices, such as JAMF. The advantage of reutilising such a system (if already in use) is that there is already an enterprise asset database that can be used as a data source for access requests and its agent will regularly update the central asset database and this will hold more information than is available in the TPM certificate. However, it’s important the TPM still needs to be involved when generating a certificate that contains the devices’ JAMF identity in case the operating system is compromised.

Workload Attestation

Workload attestation is even more involved than device identity as you should attest both the workload as well as the device it runs on. Sometimes a workload might have multiple layers of device identities to validate, for example, in Kubernetes the node and the pod attributes device attributes might need to be considered and on premises this might also include an additional layer for hardware. This blog post will not try and define what a workload is — it might be a full microservice including frontend, backend and database or it might be each of those segments — it fully depends on a company’s setup and security postures. However, it’s important to note that it’s not just production workloads that need identity, development workloads should also have an identity but might not validate requests to the same bar as a production workload.

To understand workload attestation better, we’re going to specifically look at one open-source implementation. The Secure Production Identity Framework For Everyone (SPIFFE) Project defines a framework and set of standards for identifying and securing communications between application services. SPIRE (the SPIFFE Runtime Environment) can be seen as a production ready implementation of the SPIFFE standard that runs on all major hosting platforms. Both projects are part of the Cloud Native Computing Foundation (CNCF) project who also host the Prometheus and Kubernetes projects among many others.

The most important part of the SPIFFE standard is a SPIFFE ID which is described as a string that “uniquely and specifically identifies a workload. SPIFFE IDs may also be assigned to intermediate systems that a workload runs on (such as a group of virtual machines)” and they take the form of a URI such as spiffe://trust domain/workload identifier. The SPIFFE Verifiable Identity Document (SVID) contains a single SPIFFE ID (encoded in either an x509 certificate or JWT token) although in most cases a x509 certificate is preferred to a JWT to prevent replay attacks.

It’s at this point workload identity and device identity become intertwined. SPIRE doesn’t just hand out identities to any workload that requests it (that’s not zero trust!). Through an agent which maintains a registry of workload identities and the conditions that they exist in, which are attested through a variety of plugins. Plugins can include being on specific devices (SPIRE uses the terms nodes which shouldn’t be confused with Kubernetes Nodes — although those can be attested too!), or if they are run in a specific container. There are also 3rd party plugins available that utilise TPM to provide attestation like how user device identities were generated and validated above. It can be considered in a similar way to MFA with humans (there is something you know (often a password) and something you have (such as a hardware security key)), here the device will still have some sort of account credential, but the attestations can be used as something intrinsic to the device to provide an extra level of assurance.

For example, if running a simple Python Application on an EC2 instance in AWS it will use the EC2 Instance Identity Document to attest the VM’s properties include the AMI, Region, Availability Zone or related properties like the associated Security Group.

Figure 3 — Showing the successful attestation process leading to workload identity being granted. Source: https://spiffe.io/docs/latest/spire-about/spire-concepts/

Once the node has been attested then the workload itself must be attested too. Properties that are specific to the workload include the process running the workload, the group and user that runs the workload and the orchestration system used by the workload.

When two workloads connect the SPIFFE identifier of the workload making the request must be validated by the workload receiving the request.

Combining Workload, Device and Human identities.

Once human, workload and device verification have been performed as part of an authentication request, these attributes should be verified together for additional security. For example, users might be trusted to view emails on their mobiles in their country of residence but not abroad (but for a sales team member in EMEA that might be extended to all of Europe). This already happens for basic properties in the central identity aware proxy for user requests, however a second layer of validation needs to be done by the application itself and for all workload requests.

There are many policy-based tools in the ecosystem. There are several based around XACML policies such as the WSO2 Identity Server. However, one of the biggest players in the field is the Open Policy Agent (another project under the CNCF umbrella).

Once the identity aware proxy is in place and all the data is available to the policy decision point the attributes received can then be investigated further with Artificial Intelligence (in a similar way to how cloud services such as AWS GuardDuty works) to find unusual patterns in access requests even if successful. For example, if a user logs into their phone in Paris and onto a laptop in London within 5 minutes — it’s likely one of the devices is compromised.

Conclusion.

There are many independent but related parts required for implementing a truly zero trust system and it will likely need to be implemented slowly across the organisation. It took an organisation with the technical ability and size of Google over 5 years to migrate to their initial version of BeyondCorp (based on their Whitepapers) so don’t expect instant conversion of all parts of the business. Initially focussing on the 80% use case (prioritised based on business value and data confidentiality) will still yield important security and efficiency results.

Assuming a single user identity system is in place across the organisation logical next steps include setting up a device registry complete with identifiers across the organisation, taking better advantage of TPMs on modern devices as an initial step to start validating devices as part of the authentication process in future iterations.

Further steps include designing the authentication requirements across the organisation for a centralised identity proxy. There’s no need to create rules more stringent than currently initially otherwise you risk hindering adoption. However, you need to ensure that developers can migrate to the proxy easily and existing rules can be ported into the new system. Additionally, once development work begins at all stages — creating clear, concise documentation for integrators is vital.

About the Author
George Wilson is a Principal Consulting Engineer at Version 1.