Disclaimer: This post represents personal opinions and thoughts, and does not represent the views or positions of my employer, Google.
On the last Saturday in May, at 10:48 GMT, a time when most folks in the US were still sleeping, the self-signed AddTrust External CA Root certificate expired. In an ideal world, this would have been completely uneventful; an idle curiosity for those who embrace certificate numerology, but otherwise nothing of note.
Shortly before going to bed that morning, in the process of debugging an unrelated Roku issue (trouble casting), I noticed they were reporting some disruptions should be expected, and so I tweeted a few tweets about what I expected to happen:
At the time, my thinking was that most libraries were at least capable of handling this scenario, and that at best, it’d be a measure to figure out who hadn’t upgraded to OpenSSL 1.1.x, as better handling for this was one of the important fixes in that release. For example, from the most recent version of Roku’s open source packages, it looked like they were still using OpenSSL 1.0.2h, a version with plenty of known CVEs.¹
By the time I’d woken up, my Twitter notifications were starting to see a steady stream of folks reporting broken sites, services, and devices, but it wasn’t entirely what I’d expected. While there were the servers with out-of-date versions of OpenSSL, and device vendors who looked to be downright bad at updating OpenSSL², a number of broken products seemed to be using LibreSSL, including macOS’s distribution of cURL, or GnuTLS, such as LG Smart TVs and Debian’s apt. While I expected IoT-targeting libraries that I’d worked with in the past, like MatrixSSL and wolfSSL, to still be awful now as they were when I’d worked with them, I did not expect to see modern macOS and Debian failing over.
Andrew Ayer quickly put together a post that described the problem and possible solutions, but this post is going to take a more in-depth look at the few of the open-source libraries involved, why things went bad, why they’re still bad, and what can be done about it.
Understanding The Problem
To understand the problem, it’s first necessary to dispel a common misconception about certificates. Often, when I talk to people who are responsible for configuring certificates on their servers, they often talk about the certificate chain. The singular set of certificates, from their server’s certificate to a Root owned by the CA they bought their certificate from, as if there is one Right and True Way to configure the server. Any problems that result are inherently the server’s fault, and blame should be placed on the server for being misconfigured.
It’s not unreasonable for people to have that view. After all, if you look at virtually every TLS/web server software’s configuration, you configure the chain. If you read any of the TLS RFCs prior to TLS 1.3, such as RFC 5246 (TLS 1.2) you’ll see language directing the TLS server to send the chain, an unbroken sequence of certificates to a root.³
Unfortunately, that’s not the case. There are many chains, with different chains are needed by different clients, who have different root stores and different behaviors. The server operator isn’t at fault, the problem is actually quite complex.
This particular problem was caused by libraries that were not prepared to handle that complexity, which has existed since the very earliest days of SSL/TLS and the use of publicly trusted CAs. Entire RFCs have been written about how to do things correctly, yet these libraries didn’t do that and weren’t prepared, which leads to problems like what we saw.
Figure 7 from RFC 4158 best illustrates the hidden complexity involved here, called “simple” here because we only have 3 CAs, instead of the hundreds that are actually involved. The nodes A, B, and C all represent CA certificates. To keep it “simple”, that there’s one and only one certificate for each of these CAs.
For the single server certificate (“EE”), there are four potential paths to a trusted source: EE←B¹←A¹←Trust Anchor, EE←B¹←A²←C¹←Trust Anchor, EE←B²←C¹🡐Trust Anchor, and EE←B²←C²←A¹←Trust Anchor. A¹ and A² are used to indicate that while there’s a single logical CA, called A, there are two distinct certificates associated with it, each with their own properties, restrictions, and, relevant to this incident, expiration. A¹ is issued by our Trust Anchor, while A² is issued by C, a CA that itself has two certificates.
A good PKI implementation, one robust against problems like this, is one that is capable of finding and evaluating all four of those paths. For example, imagine that C¹ and C² both were expired/revoked/untrusted. This would mean that there’s only one valid certificate chain that ends in a trust anchor: EE←B¹←A¹←Trust Anchor. If the server sent the chain EE←B¹←A²←C¹, which includes the expired C¹ certificate, a robust library would know how to replace A² with A¹, which would lead to that trusted path, ensuring things Just Work.
This is roughly what happened with the AddTrust expiration. AddTrust External CA Root was C¹, and USERTrust RSA Certification Authority was A². Clients that had trouble connecting were clients that didn’t know how to swap out A² with A¹, and thus couldn’t verify the path, and blew up spectacularly.
Without wanting to get too bogged down in details, it’s worth noting that the term “trust anchor” is not interchangeable with “self-signed/root certificate”. A trust anchor is, at the core, just a CA name (as in, a Distinguished Name) and a public key, although it can have many other optional attributes attached, such as expiration or purpose. It doesn’t have to be a root certificate, or even a certificate for that matter, although certificates are the easiest way to configure them. Looking at that Figure 7 graph again, it’s possible for any CA in that graph to be configured as a trust anchor. That is, if A is a trust anchor, then the client trusts A¹ and A², and if C is a trust anchor, the client trusts C¹ and C² equally⁴. Trust Anchors don’t have to be self-signed — they’re just keys and names — and they affect how certificate paths are verified. This is relevant to the AddTrust expiration, because properly recognizing that USERTrust RSA Certification Authority was itself already configured as a trust anchor would have, in this specific case, also avoided the problem.
Visualizing the Problem
In practice, the PKI for publicly trusted CAs is far more complex. You can play around with this in an off-the-cuff visualization (source queries) I worked up to see this, which shows just the “current” Mozilla trust store. While my D3 skills are terrible, this hopefully makes it easy to visualize the inherent complexity and relationships, as any certificate with more than one arrow pointing away from it means there are more than one certificate paths. Censys is also helpful, but as you can see in some cases, it doesn’t exhaustively display all possible certificate paths that a client might need or encounter. It does, however, show how just four nodes can be represented by 11 different certificate paths.
As messy as this is, it gets worse when considering more than just one trust store. Scott Helme has a decent enough write-up of the problem of legacy clients, and how different paths are needed. The above query is meant to give a small taste of that complexity, which continues to increase when considering older versions of Mozilla’s trust store, such as on poorly-maintained Linux distros, and the trust stores of other vendors, such as Apple and Microsoft.
A different way of thinking about the problem is that anytime a CA has two or more different issuers, the path graph gets complicated. Looking at the list of CAs that have this property, you see every major CA represented. This “problem” was an intentionally introduced feature of X.509v3, which allowed for cross-certificates, in order to better reflect how organizations express trust, and has been widely used since, but is not widely supported outside of browser and operating system verification stacks.
How to Avoid the Problem
The first step to avoiding the problem is no longer thinking about “the” certificate chain, and instead thinking of building and verifying potential certificate paths, each of which are chains to a trust anchor.
Successful client implementations all have one thing in common: they treat the problem as a graph traversal problem, as described in RFC 4158. The graph is constructed based on the nodes available to the client: the server’s certificate, the additional certificates it sent via TLS, certificates the client has available (both trusted and untrusted), and potentially online sources like those retrieved on-the-fly via the authorityInformationAccess extension. The goal is to build a valid path, through this graph, from the end-entity certificate to a trust anchor, by using a depth-first search. The order the server sends things might be used as an optimization hint, or it might be ignored, but in any event, as many paths as possible are tried until one of them works.
RFC 4158 covers this in far greater depth, including discussions about how to avoid cycles and optimize selection. While Golang’s verifier leaves a number of optimizations on the table, it still manages to do this path building in a little less than 400 lines of code, highlighting how easy a DFS search is to implement.
It’s typically quite easy to spot a library that’s not capable of this: if the API assumes that there is a singular issuer certificate, it’s going to have a bad time. This can take different forms, but the most common anti-patterns are “get the certificate with this subject name” or “get the issuer for this certificate”. APIs that design around this assumption aren’t really capable of tackling the graph, and as a consequence, will fail in some form. An API should be capable of returning multiple certificates that match a given subject name, so that it can consider all of these when building a certificate path.
More Ways to Go Wrong
Even if a library supports path building, doing some form of depth-first search in the PKI graph, the next most common mistake is still treating path building and path verification as separable, independent steps. That is, the path builder finds “a chain” that is rooted in a trusted CA, and then completes. The completed chain is then handed to a path verifier, which asks “Does this chain meet all the caller’s/application’s requirements”, and returns a “Yes/No” answer. If the answer is “No”, you want the path builder to consider those other paths in the graph, to see if there are any “Yes” paths. Yet if the path building and verification steps are different, you’re bound to have a bad time.
“No” can come for a number of reasons, many of which are touched on in RFC 4158. One of the certificates may be revoked, the certificate might be trusted but not for the purpose the application wants (e.g. trusted for S/MIME e-mails, but not TLS), the certificate policies could be incompatible, the client might have restrictions applied, etc. In the case of Sectigo, “a certificate is expired” was the cause for the “No”, but the applications weren’t prepared to handle that. If your application supports path building, but still assumes there is “the chain” that is supplied to the verifier, bad things will happen.
Key Elements of a Successful Implementation
If you develop, or contribute, to a library that was affected by this issue, and want to make things more robust, what are the key properties an implementation should have so it can be prepared?
- Make sure your APIs return issuers, and not just a single issuer.
- When returning issuers, have a plan to sort them. You can do a simple sort, such as preferring trusted certificates first, or you could consider the strategies from RFC 4158. As RFC 4158 calls out, for every positive example, there’s likely a negative counter-example as well; the joy of engineering is finding the right balance for the use case.
- Treat the certificates from the server as TLS 1.3 describes: a collection of certificates that can be used to build out the graph, rather than as an ordered linear chain, although with the first certificate as the server’s certificate.
- Support some way of discovering additional links in the graph. This could be by allowing the calling application to provide a set of “not positively trusted” certificates, such as Mozilla’s intermediate preloading does, or it could mean supporting fetching authorityInformationAccess and allowing the CA to provide these additional certificates.
- Integrate any checks as part of path building, such that path verification is merely a part of path building. You don’t need to build every chain, and then try to verify every chain; verify-as-you-go is a fine strategy. However, it’s essential that if a chain doesn’t verify, path building continues and tries to exhaust all paths before returning.
- As with all graph algorithms, know your limits. Whether it’s the length/depth of the chain, the number of paths explored, the number of signatures verified, or the total time spent examining the graph, apply bounds to limit shenanigans.
Built for the Internet
Once a basic path builder is implemented, as described above, it’s also necessary to think about a number of important API decisions that are relevant for TLS server authentication on the Internet. Some of these are documented, some are lessons hard learned, but all are important if you’re using the same CAs as browsers use, and hoping for the same security.
RFC 5280 requires (in the RFC 6919 sense) support for nameConstraints. However, support is somewhat loose; only the directoryName constraints need to be supported, and other name types can be ignored if the nameConstraints extension wasn’t marked critical. Unfortunately, dueto older versions of OpenSSL and macOS, which didn’t support nameConstraints and thus certificates correctly failing to work with these systems when the extension was marked critical, the browsers of the CA/Browser Forum decided to allow CAs to issue certificates without marking the extension critical. This is because systems that implemented nameConstraints were protected, and constraining CAs was better than not constraining them, so it was worth the deviation.
If the client library doesn’t support nameConstraints, it’s exposed to risk from these CAs, and so it’s important to fix.
However, beyond just supporting nameConstraints over the subjectAltName, it’s important to support nameConstraints on the commonName, if the commonName is supported. If a client library supports falling back to commonName, and doesn’t enforce these constraints, the CA can bypass the nameConstraints entirely!
Tools like https://nameconstraints.bettertls.com/ provide useful test suites to test exactly these sorts of issues.
Extended Key Usages
Another common gotcha is not checking Extended Key Usage at all, or not checking it in a “browser-compatible” world. When X.509v3 was introduced, the belief was that Certificate Policies would be the main way that CAs were limited in what and how they issued, as that’s what it was designed for. Unfortunately for the IETF, rough consensus and running code had different ideas on how to restrict issuance, and that involved the use of EKUs. While controversial to some, the reality is that the majority of browsers require the EKU on a Subscriber certificate be a subset of the EKUs of the certification path. For example, if an intermediate has an EKU which indicates S/MIME, it cannot be used to issue TLS certificates.
This important check could have prevented Flame’s MD5 collision from being useful, but more importantly, it’s the basis for many of the policy decisions in root stores. If one of the EKUs that root store cares about is not present within an Intermediate CA, then in today’s world, “out of sight, out of mind”.
It used to be that, beyond checking for id-kp-serverAuth, implementations also needed to allow intermediates to assert Netscape or Microsoft’s Server-Gated Cryptography EKUs. Why? Because Sectigo’s other cross-sign only supported those EKUs. Luckily, that expired June of 2019, and so existing libraries that do that mapping, in order to keep Sectigo certs working, no longer need to do so.
Weak Crypto Handling
It’s 2020. Do you know where your validation libraries weak crypto knobs are? Successfully disabling SHA-1 requires a robust certificate path builder, as it depends on the ability to consider alternative paths, such as handling a SHA-256 cross-sign instead of the SHA-1 cross-sign. It’s also necessary to handle situations where one trust anchor is signed with SHA-1 by another trust anchor. If the path builder doesn’t know how to discover that the first cert is actually a trust anchor, it’d incorrectly reject it.
Trust Store Design and Trust Anchor Restrictions
Every browser root store has some notion of additional attributes that can be associated with a trust anchor. For those that store their trust anchors as certificates, it’s rare that these attributes are actually carried in the certificate themselves; they’re often part of the trust store format itself. Users should be wary if the library stores trust anchors as certificates on disk, or in arrays, because the API design fundamentally prevents carrying these attributes through.
The most important restriction, particularly for trust stores shared among many applications, is the trust purpose of the CA. Some CAs are only trusted for e-mail, some only for document signing, and some only for the Web. A single trust store often exposes unnecessary risk, because these CAs aren’t managed the same and thinking about the same risk.
A more popular recent restriction is limiting CAs from issuing new certificates, as measured by the notBefore date of the certificate. If a certificate’s notBefore is greater than some date, it’s not trusted. This is meant to gradually sunset trust in a CA. However, nothing prevents the CA from deceptively setting the notBefore some date in the past, and so it is often accompanied with ensuring some maximum lifetime for certificates, to put an upper bound on the sunset period and how long the CA can backdate.
Open Source Roundup
Now that you’ve got a better understanding of why/how things work, continue on to the Implementation Showdown to see how various open-source libraries stack up.
¹ Roku OS 9.2 is admittedly a bit old, first being released on 2019–09–19, which was still (slightly) before it was EOLed, although premium support contracts are still available. Roku OS 9.3 is the current release, but no source is available. I’ve emailed them requesting the source for 9.3, in order to figure out whether they made it to OpenSSL 1.1.x, but to no response. In theory they could be on Premium Level Support.
² While I love my Ubiquiti Unifi gear, their history of GPL woes makes it painful to be certain. While I don’t know for sure that the UDM Pro issue reported was tied to old OpenSSL, one of the few products I could find the GPL Archive for, the EdgeRouter 8-XG Firmware v1.10.11, dated 2020–03–01, ships a version of OpenSSL that is over 4 years old: 1.0.1t. This doesn’t inspire much hope.
³ RFC 8446, the TLS 1.3 RFC, thankfully added language to address this, acknowledging that the only true invariant is that the server’s certificate comes first.
⁴ RFC 6818, Section 4 adds some useful clarifications here. If using a self-signed certificate as a trust anchor, it’s implementation defined whether or not the policies and restrictions apply. For example, some implementations, like Android, ignore the expiration date on certificates used as Trust Anchors, while other implementations, like OpenSSL, enforce them. The trade-offs in these approaches is worthy of an entirely separate article, but the important part is understanding that a trust anchor doesn’t have to be self-signed in order to be a trust anchor. The signature, and potentially many other attributes of the certificate, is ignored.