Retrieving, Storing and Querying 250M+ Certificates Like a Boss

Pictured Above: Google’s BigQuery infrastructure hard at work.

In part 1 we examined HOW to parse Certificate Transparency Log information using Python, but the real value comes from being able to retrieve, store, and search them cheaply and efficiently, which what I set out to do.

NOTE: If you’re coming from the pentester.io article, all you have to do is email me for access to this dataset — ryan@calidog.io !

Lets harvest some certificates!

According to the Certificate Transparency RFC, the api endpoint get-entries is what we’re looking for to pull down the entries from each certificate transparency log. Unfortunately for us it uses limit/offset pagination (in the form of the start and end parameters), and a lot of the running logs only let you pull down 64 at a time. Thankfully the Google CTLS, which make up the majority of the certificates, all use a max response size of 1024. It’s the small victories sometimes 😓.

Since this problem is both IO bound (retrieving the certs over the network) AND CPU bound (parsing the certs themselves), we’re going to need to leverage both concurrency and multi-processing to efficiently retrieve and store these certificates.

Enter the Axeman

With a logo this chique, you’d think this was written in ECMAScript 6

Since there didn’t seem to be anything out there that helped us retrieve and parse these certificates in a straight-forward way (sans some crusty-looking Google tools) I decided to spend some time writing a utility to fit our needs. The result is Axeman, a simple utility that uses python3 asyncio and the excellent aioprocessing library to download certificates, parse them, and store the results in multiple CSVs while moving the bounding to the network speed (and response times) of whatever machine it’s on.

Go check it out on GitHub if you’re interested in playing with it!

To the cloud!

After spinning up a 16-core, 32GB memory instance with a 750GB SSD (thank you Google’s free 300$ credit for new accounts!), I set Axeman loose, which downloaded all certificates in less than a day, and stored the results in /tmp/certificates/$CTL_DOMAIN/.

Now THAT’S being a noisy neighbor

Where to put all this data?

On my first iteration for this project I optioned Postgres to do the storage and querying functionality, and while I’m confident that with a proper schema and indexes that Postgres wouldn’t have any trouble with 250M entries (unlike my sloppy first pass which took ~20 min per query!), I decided to look around at some other solutions that had:

  • Cheap storage
  • Fast querying
  • Easy updates

There were a few players in the field, but from a cost perspective pretty much no matter where I looked all hosted DB instances (AWS RDS, Heroku Postgres, Google Cloud SQL) would be pretty cost-prohibitive and the overhead of upkeep was annoying to deal with. Thankfully since our data doesn’t really ever have to change, we’re able to be a bit more flexible with our storage and querying.

This is actually the sort of querying that’s perfectly suited for a map/reduce sort of worker fleet, using something like Spark or Pig backed by Hadoop. After scouting around the “big data” sections for the main providers (even though this problem is far from “big”), I found Google BigQuery, which seems to fit our needs very nicely.

The Ingest

Getting data into BigQuery is pretty trivial, thanks to the gsutil tool that Google puts out, so we create a new bucket to hold our certificates:

Now that we have a bucket in place we can use the excellent gsutil to transfer all of our certificates to Google storage (and later to BigQuery). Make sure to run gsutil configfirst to get your account set up!

Once you’re ready, release the kraken!

gsutil -o GSUtil:parallel_composite_upload_threshold=150M \
-m cp \
/tmp/certificates/* \
gs://all-certificates

The result should be the following in your bucket:

Next we switch over to the BigQuery console, and create a new dataset:

Time to get importing! One of the hurdles that I encountered was that there didn’t seem to be an easy way to say “please just import every folder recursively”, so we have to import every CTL manually. Lame, but doesn’t take long at all.

Next, we’ll create a table to store all of our data and import our first log.

Absolutely note the two options I’ve set at the bottom!

Since we need a schema every time we import, you can just click “Edit as text” and use the following:

After that, it’s simply rinse and repeat for the rest of the logs in our bucket (If you know a better way to do this, please let me know in the comments!). Make sure each job completes successfully (if there are warnings you can usually ignore them, just make sure you set a reasonable threshold for failures), and then you should be off to the races!

Now that’s a spicy dataset! 👨‍🍳

On to the fun part!

Now to bear the fruits of our labor and start using our fancy new powers to search for needles in 250M+ haystacks!

Recently there’s been a lot of talk about punycode addresses used for bad, so let’s look for all certificates that have been issued for punnycode addresses:

Which yields the following:

15.5 seconds to munge 272M+ rows? Yes please!

And ~15 seconds later, you have every punycode address that any of these CTLs have seen! From here you can easily parse them all out and look for homoglyph attacks.

How about another example, let’s find all certificates that have been publicly issued for any Coinbase property through these CTLs:

Which yields:

2 seconds to get this data is *ludicrous* speed

As you can see, the ability to run these sorts of analytics across this large dataset is an extremely powerful tool. It lets you notice trends and patterns you otherwise wouldn’t really see, and this is only scratching the surface of the cool stuff you can do with this!

A bit of a head-scratcher

While doing this research (and building a soon-to-be-released awesome additional tool to this project), I noticed something weird, the domain flowers-to-the-world.com kept cropping up everywhere (seriously, I felt like Jim Carey in “The Number 23”). I have no idea who is submitting it to the certificate transparency logs, but it’s submitted a huge amount of times to pretty much every CTL out there (91,246 times as of April 2017).

That’s 91,246 submissions spread among every CTL we watch 😱

The whois information says it’s owned by Google, so I’m wondering if this is part of a test suite that gets run (much like the image that’s been submitted to Imgur roughly a million times!). If you’re a Google engineer who can ask the folks running the CTLs or know more about it I’m curious to hear the explanation!

Lastly, an offer to the community

I’m a big fan of standing on the shoulders of giants, and avoiding repeating yourself when possible, so I’m happy to share this dataset with anyone who wants it. All you have to do is ping me at ryan@calidog.io. The bigquery dataset is updated daily around 6PM PST.

Happy hunting!

Like what you read? Give Ryan Sears a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.