Path Building vs Path Verifying: Implementation Showdown
Disclaimer: This post represents personal opinions and thoughts, and does not represent the views or positions of my employer, Google.
In my previous post, I talked about what the issue with Sectigo’s expired root was, from the perspective of the PKI graph, and talked a bit about what makes a good certificate verifier implementation. Unfortunately, despite browsers and commercial OSes mostly handling this issue, the sheer variety of open-source implementations means that there’s a number of not-so-good verifiers out there.
In this post, I’ll dig in a little deeper, looking at specific implementations, and talking about how their strategies either lead to this issue, or avoided this issue but will lead to other issues.
While this is not a comprehensive look at every third-party library, nor of every piece of functionality needed to cope with the broader Internet, it is meant to provide a “brief” overview of how different libraries handle this graph problem, and how different design decisions can lead to bad results.
For folks who are interested in fixing these libraries, I cannot stress enough, read RFC 4158. An implementation that considers even those basic strategies is vastly easier to maintain, predict, and secure, both in the short-term and the long-term.
I’ve mostly focused on source analysis, so I very well could have botched something, missing a global config variable, or didn’t look into that One Weird Function, so apologies if that’s the case.
OpenSSL’s design does not support path building in any meaningful sense. The bulk of OpenSSL’s path validation logic lives in the build_chain of x509_vfy.c. Despite improvements made during the 1.0.0 series to support nameConstraints, among others, and 1.1.0’s improvements causing it to actually recognize trust anchors, OpenSSL remains a bit of a case study for how not to verify certificates.
- The OpenSSL public and internal APIs assume there is only a singular issuer certificate.
- It applies the X.509v3 extensions and application policies (like algorithm checks) after chain building.
- It checks revocation after chain building.
The overall state machine/algorithm that OpenSSL uses is a bit of a hodge-podge of oddities that can cause issues. For example, CRLs are checked before verifying signatures, which can cause issues with CAs that use multiple public keys with the same name, such as for CA rollover, since they won’t verify the CRL correctly.
The core algorithm has effectively three modes: trusted-first (X509_V_FLAG_TRUSTED_FIRST, introduced 1.0.2, default 1.1.x+), random/server-preference (X509_V_FLAG_NO_ALT_CHAINS; introduced > 1.0.2b), and trusted-last (neither/legacy).
In all three modes, the verifier starts with the server certificate, and then continues to try to add certificates until it finds a chain that works. In trusted-first mode, it’ll search the trusted store first. In random/server-preference, it will search the untrusted store. I call this “random” because, as mentioned, the API only supports returning a single certificate for an issuer, so if the peer sends two certificates that match the issuer name in the TLS handshake, it’s “undefined” which you’ll get, at least by API contract. This process continues until a trusted certificate is found.
With trusted-last, which is the 1.0.2–1.1.x behavior, if it doesn’t find a path to a trusted root using the random/server-provided method, it works backwards to start cutting off untrusted certificates, trying to replace them with trusted certificates.
The above algorithm reflects the general anti-pattern for certificate verification, which is “append to win”, which is why OpenSSL continues to have issues.
For example, the trusted-first mode doesn’t consider the validity period or other attributes of the certificate, as those are verified after the chain has been returned. Using the examples from the previous post, imagine the following two chains: EE←B¹←A¹←Trust Anchor and EE←B¹←A²←C¹←Trust Anchor. For OpenSSL, “←Trust Anchor” just means the previous cert, either A¹ or C¹, is on the filesystem. If A¹ is expired, the trusted-first logic will cause validation to fail, because it will ignore the server-supplied A², which it could use to build the chain to C, while trusted-last works. Of course, trusted-last can have its own issues, as the Sectigo incident shows.
OpenSSL does check EKUs, by way of calling the X509_check_purpose, and it does seem to check this for the entire chain. It will check the commonName against nameConstraints, if no DNS name SANs are present, as well as the emailAddress attribute.
While LibreSSL is very similar to OpenSSL, given their shared ancestry, not all of the changes and refactoring that has happened in OpenSSL has migrated to LibreSSL. LibreSSL still keeps the bulk of the algorithm in X509_verify_cert, and behaves similar to an OpenSSL 1.0.2–1.1.x client (that is, trusted-last).
In light of the Sectigo expiration, a small tweak was made, such that if the only untrusted certificate provided is expired, it will instead search the trust store for an unexpired certificate, before adding the untrusted certificate.
While this tries to split the difference between trusted-first and trusted-last, the unfortunate shared ancestry with OpenSSL means that it suffers from the same post-chain verification hodge-podge, and so bad things still happen.
BoringSSL, like LibreSSL, shares a history with OpenSSL that diverged following Heartbleed. Like LibreSSL, there’s still a flow of patches and work in both directions, although the certificate verification largely remains untouched, as the plan is to move away from the X509* API entirely, as much as possible.
BoringSSL currently hasn’t adopted any changes like LibreSSL, although that could change in the future. For now, it’s trusted-last.
For all of my criticism for OpenSSL, GnuTLS remains some of the most obnoxiously obtuse certificate verification code I’ve read, and I’ve spent plenty of time tracing Windows assembly in WinDBG debugging Crypt32.dll. It reminds me a bit of FizzBuzz Enterprise Edition, on first reading.
My experience with GnuTLS is mostly through seeing research papers discovering cryptographic implementation issues. I was surprised, as in writing up this post, it looks like GnuTLS’s advisory page glosses right over these sorts of issues. Having spent time reading through the code, I’m not surprised Tavis was able to find low-hanging double-frees.
GnuTLS bashing side, despite its maddening complexity, GnuTLS boils down to a very simple function, even more basic than OpenSSL. According to both the documentation and past mailing list posts, the entry-point is gnutls_x509_trust_list_verify_crt / gnutls_x509_trust_list_verify_crt2. By default, the bulk of path building is handled via _gnutls_sort_clist, which orders the untrusted certificates into a sequential chain, to then have trusted certificates added.
While structurally, this would appear similar OpenSSL’s trusted-last implementation, the actual implementation goes trusted-first, in that after the list is sorted, certificates are swapped with known-trusted variants. When this issue first came to their attention, a fix similar to LibreSSL’s was adopted, by peeking in the trust store for an unexpired certificate and swapping things out. For GnuTLS, it appears that the bug was a logic bug in the mostly-identical duplicate function _gnutls_pkcs11_verify_crt_status function, which is used to search PKCS#11 tokens for trusted certificates, and evaluated the expiration of the wrong certificate.
Similar to OpenSSL, the policy checks come at the end of all of this, which can lead to all sorts of sad times when different paths have different policies attached, are revoked, and so on.
It does check the commonName and emailAddress against nameConstraints, as appropriate, and does make sure all of the EKUs chain.
I know I was rather critical of OpenSSL above, but in no uncertain terms, I would advise against using GnuTLS. Having spent time reading and tracing the code, it has all the hallmarks of boundless complexity masquerading as flexibility. The abstractions harm readability and debugability, while the actual implementation leaves much to be desired. Heartbleed was a result, in part, of not saying no to needless complexity, and GnuTLS has the similar “code smell” of supporting both everything and nothing at the same time. While OpenSSL got some discipline there, and forks like BoringSSL made their mark by actively rejecting and removing complexity, GnuTLS has the hallmarks of wanting/needing to provide a “stable” ABI for a long time, without the ability or willingness to say no or revisit past decisions.
Like GnuTLS, most of my experience with Botan is reading about how it once again is vulnerable to some new cryptographic attack that other libraries were immune from. However, since I was asked whether it was affected, I spent some time looking at the code.
Given all of the above remarks about libraries failing, consider my surprise when Botan actually did try to explore the entire certificate path graph for a valid chain, and incorporated the policy checks for each path as part of the path selection.
It doesn’t check EKUs as part of chain processing, as best as I can tell, nor does it seem to apply nameConstraints to the commonName, although it supports commonName fallback.
I wouldn’t recommend Botan, both due to issues like this and the fundamental architecture that lends itself to security vulnerabilities. But beyond the potential security issues, the actual implementation just seems very inefficient, throughout. Reading the code, it begins allocating memory to track every possible path, even before verifying these paths, bringing worst-case performance to the forefront. There’s likely low-hanging memory DoS fruit there. Similarly, during the certificate path processing for nameConstraints, PKIX::check_chain builds a forward path from the leaf to the root, checking extensions along the way. When it begins processing a nameConstraint extension, it recomputes and validates the nameConstraints for all of the remainder of the path, burning CPU to constantly recompute the same thing. RFC 4158 talks about how implementations can optimize this, but it feels fairly wasteful to leave this performance on the table. While I didn’t spend any time compiling it and getting it set up, it would be interesting to see how well it fares against test suites like BetterTLS, as there were things that just didn’t look right.
While I did look at other libraries, such as BearSSL and yaSSL, it’s not surprising that the IoT-heavy focus of these libraries means they’re intentionally limited in what they support. For IoT, where memory is at a premium, graph traversal means buffering things in memory, which is often anathema to these constrained environments.
To BearSSL’s credit, the limitations and expectations are clearly documented. Reading through the code, the X.509v3 verification is streamed as a linear series of certificates, using a trusted-first approach before processing the next certificate. While it’s possible to save these certificates, so that a consumer of the library could implement path building via the chain engine API, that’s not really fitting with the spirit of the library. It also doesn’t process nameConstraints, but that’s not surprising given the complexity.
If you are going to use an IoT library, or build an IoT device, then you need to make sure you fully control the client, the server, and all PKIs involved in issuing certificates. Which is to say, avoid using publicly trusted certificates entirely.
When Browsers Got It Wrong
While browsers were, by and large, completely unaffected by the Sectigo expiration, this doesn’t mean they’ve been immune from issues that share a similar root cause.
MacOS Expired Intermediates
Beginning with macOS 10.12, certificate verification started going through a modern verifier, trustd, which had previously been available on iOS. However, prior to that, the macOS verifier shared many of the same issues as described above. While it supported returning multiple issuer certificates, it didn’t do a full graph analysis, and instead tried to pick the “best”.
Unfortunately, sometimes the “best” isn’t good enough, particularly when an expired intermediate is installed in the Keychain, leading to a scenario similar to the OpenSSL trusted-first issue.
Considering that Chromium still runs on, supports, and is actively developed for macOS 10.10 and 10.11, it has a giant rat’s nest of complexity to try and work through the various bugs and quirks in the OS verification stack, accumulated over the years. While Chromium is in the process of replacing all of this with its own integrated verifier, it highlights that even the OS doesn’t always get things right.
NSS: A Tale of Three Verifiers
NSS, which is used by Mozilla, actually has three independent certificate verification stacks. “Legacy”, the verifier originally used by Firefox, “libpkix”, a ‘port’ of the Java APIs to C, via 7 layers of preprocessor macros, and “mozilla::pkix”, or “mozpkix”. Each of these verifiers has their own quirks and bugs, although some more intentional than the others.
“legacy” was implemented similar to OpenSSL or GnuTLS, by finding “the” issuer for each certificate along the way. However, the reason Firefox wasn’t all doom and gloom when using this verifier, despite the many bugs it did cause, was that under the hood, NSS actually gathered all issuers, and used a function nssCertificateArray_FindBestCertificate to try to find the certificate path most likely to succeed. A CA could use this logic to gently nudge NSS into finding a path that would work, by manipulating the fields in certificates to cause them to sort better or worse for legacy clients.
“libpkix” was a hand-translation of the concepts found in Java’s PathBuilder/PathVerifier classes, adopted to C, and with heavy use of macros. While Mozilla never switched to it, over concerns around performance and memory usage, except for verifying EV certificates, Chromium launched using this verifier from the get-go. Despite the incredible complexity in reading and maintaining the code, it was a more faithful implementation of RFC 4158 than any other library out there.
While Mozilla considered adopting “libpkix” for Firefox, they looked at the effort spent by the Chromium team maintaining and debugging that code, among many reasons, opted for a ground-up rewrite into a modern verifier, similar to Apple with trustd. This was initially known as “insanity::pkix”, and was later renamed to “mozilla::pkix”, and was lead by Brian Smith while he was still at Mozilla. Brian’s write-up of the “why” gives context about the issues with the first two verifiers, and the design work he started there has continued with his Rust implementation of a modern verifier, webpki.
Unless you take explicit action to use mozilla::pkix/mozpkix, you’ll likely end up with the legacy verifier. libpkix has been largely demoted to opt-in, but it may still be included by some distros, given that Firefox has moved on from it. The legacy verifier shares many of the faults and foibles of implementations discussed above, and so “use NSS” is not quite the answer as “use mozpkix.”
Android has been a complex, and sometimes unfortunate, story. The current implementation uses a recursive algorithm to traverse the graph, using all available certificates to try to find certificate paths that comply with local security policy. While this is still not robust enough to handle the Web at large without errors¹, been a massive improvement over where things were. In Android releases before Android N, Android used an algorithm similar to many of the libraries above: a trusted-last approach that tried to work backwards from the server-supplied chain to a trust anchor, although with its own set of embarrassing security issues.
Despite CryptoAPI being one of the most robust certificate validators, in part due to it supporting just about every scenario mentioned in RFC 5280 and RFC 4158, it’s not been without its own bugs. CryptoAPI is used throughout Windows, including IE/Edge, although it’s in the process of being phased out in Chromium.
While certain bugs are still so astounding that nearly twenty years later it’s still surprising that they even happened, and bugs since then have similarly shocked and amazed, the actual path building and handling has turned out to be incredibly robust for a variety of certificate graphs.
The one exception to this is when systems have Automatic Root Update disabled. When Windows is released, there’s a copy of the root store provided, traditionally as a resource within crypt32.dll. When Windows verifies a certificate chain, and it encounters an untrusted self-signed root, it will check to see if AuthRoot needs to be updated, pulling down a list of hashes of certificates and other attributes, and potentially fetching the trusted root, if needed. When AuthRoot is disabled, or if the user isn’t able to connect to Microsoft to update, however, this can all fall apart, with clients only able to build paths to expired or no longer valid roots. While this isn’t strictly a path building / path validation issue, it does reflect the necessity that having a robust verifier isn’t sufficient, and regular updates are still necessary.
Despite the length of this post, it has primarily focused on how different libraries build and validate certificate chains, in terms of simply discovering issuers. If the API doesn’t support multiple issuers, questions such as application provided untrusted certificates become irrelevant, because the extra certs aren’t going to help find a valid path. It also limits the value of trust store management, if any change to the trust store could cause otherwise valid certificates to break.
This is all to say that RFC 4158 is a starting point, and not the end goal. However, in order to address many of the security issues that can arise in certificate validation, it’s absolutely critical to support some basic path building via graph traversal. If an implementation supports a graph traversal algorithm, adding additional policies becomes significantly easier, which improves both the security and the robustness of client software. However, the current approaches used by many open-source libraries means that they are themselves some of the most dangerous code in the world.
There’s a number of libraries that, in the course of tweeting about the research process for this, people were interested in. Is library X bad? Is library Y good? While I answered some of these tweets, my hope is that the analysis and links provided here help people look at how they can evaluate themselves. The biggest signal of doom begins with looking at how issuer certificates are discovered. If only a single certificate can be returned, the implementation is going to have a bad time, so start there.
¹ Where the Wild Warnings Are: Root Causes of Chrome Certificate Errors, Section 10.3