Parsing Certificate Transparency Logs Like a Boss

Beaker knows whats up

This is part 1 of a series I’m doing on collecting, parsing, storing, and querying 250,000,000+ certificates from CTL logs, You can find part two here!

While building our soon-to-be-released first product — phishfinder, I spent a large amount of time thinking about the anatomy of a phishing attack, and the data sources that would allow us to detect evidence and artifacts of phishing campaigns as they were getting started, before they have time to do any real damage.

Among the sources we’ve integrated (and definitely one of the cooler sources that exists) is the Certificate Transparency Log (CTL), a project started by Ben Laurie and Adam Langley at Google. At a high level, a CTL is pretty much what it sounds like — a log for storing a cryptographically-verifiable immutable list of issued certificates from a central authority, stored in a Merkle Tree.

Show me what you got!

In order to figure out what scale we’re dealing with, let’s see how many certificates each “known log” on the CTL site knows about

Python 3 because, you know, 2017

Which outputs the following:

ct.googleapis.com/pilot has 92,224,404 certificates
ct.googleapis.com/aviator has 46,466,472 certificates
ct1.digicert-ct.com/log has 1,577,183 certificates
ct.googleapis.com/rocketeer has 89,391,361 certificates
ct.ws.symantec.com has 3,562,198 certificates
ctlog.api.venafi.com has 94,797 certificates
vega.ws.symantec.com has 200,401 certificates
ctserver.cnnic.cn has 5,081 certificates
ctlog.wosign.com has 1,387,492 certificates
ct.startssl.com has 293,374 certificates
ct.googleapis.com/skydiver has 1,249,079 certificates
ct.googleapis.com/icarus has 48,585,765 certificates
Total certs -> 285,037,607

Excellent! 285,037,607 certificates at time of writing. That’s not a *huge* amount of data, but it is definitely a decent effort to store and query effectively. More on that in part 2.

The anatomy of a CTL

CTLs operate over http, which is nice for us, since it’s trivial to use with today’s libraries. Unfortunately the data structure for each result is a fairly opaque binary stream, which can be somewhat daunting to parse. Let’s look at what each result looks like:

So each entry has two keys, leaf_input and extra_data . Reading RFC6962, we see that leaf_input is a base64-encoded MerkleTreeLeaf structure and extra_data is either a base64-encoded certificate chain, or a base64-encoded PrecertChainEntry structure. Cool beans.

On PreCerts

It actually took me quite a while to grok what the hell a PreCert even is (go ahead, read section 3.1 of the RFC), or what its role really is. Apparently I wasn’t the only one, so I take solace in that. I’ll save you a good deal of googling and head scratching and distill it down to this:

PreCerts are a type of certificate that is issued before a CA issues the “real” certificate. It’s essentially a copy of the certificate that will be issued, but with a “poison” X509 V3 extension which is marked as critical that no platform should validate as legitimate since it either understands the OID and that it’s a PreCert, or it has no idea what to do with it and therefore doesn’t validate.

The security hat in me disagrees with this strategy simply because x509/ASN.1 parsing bugs are quite prevalent and some implementations may be vulnerable to some shenanigans that allow a PreCert to be validated as legitimate. I get why they did it, but it seems like nixing the idea entirely and having only certificates that have been truly issued into the CTL makes more sense.

Let’s parse some binary!

As a reverse-engineer, and someone who does CTFs on occasion, parsing binary structures and streams is no new task for me. Most people reach for the struct module, but years ago while working for Philip Martin, he turned me on to the *excellent* Construct project, which greatly simplifies parsing declared structures (yay no more cursor arithmetic!). Here are the structures I ended up using for my parsing needs, along with some sample parsing using the PyOpenSSL module:

As you can see, Construct is a much cleaner way to define binary structures in python :). It makes parsing each CTL entry quite a breeze, and gives a really nice interface for dealing with binary content.

Using this, it’s pretty straight forward to parse and store CTL entry logs, but more on that in part 2!