Ranking the top subdomains using Rapid 7’s Sonar FDNS dataset

Ryan Elkins
5 min readJul 11, 2020

--

Continuing with the Project Straylight accelerator series, I have been wanting to re-organize all of my wordlists and begin to make them more data driven. I was actually listening to the latest iteration of The Bug Hunters Methodology — Recon Edition today as Jason Haddix was discussing his aggregated all.txt list which is a massive file of words that he has compiled through his years of recon. His commentary inspired me to pursue the creation of these data-driven, custom wordlists.

First, let’s cut to the chase, here is a snapshot of the top 10 most common FDNS records alongside the count of their occurrences. We will walk through the generation of this in detail and how to build off of this.

We will begin this walkthrough with a dependency on the ability to query Rapid 7’s Sonar FDNS dataset as described in my previous article, External IP Domain Reconnaissance and Attack Surface Visualization in Under 2 Minutes and determined that we could build onto that. I thought, how hard would it be to generate a count of the subdomain occurrences across the entire dataset? It would be fascinating to see and utilize the results! From a #bugbounty perspective, there is often greater value deeper in the lists as it is trivial to find the most common keywords, but if you want a 5.4 GB CSV with over 231 million distinct, real subdomains across the .com FDNS space, then you have come to the right place.

Obstacles

As with many technical projects, the solution always seems much easier and more straightforward than the path to get there. I would like to share some of the obstacles faced, the rationale behind them, and the steps taken to maintain the integrity of this list.

Obstacle one: I wanted the data-driven metrics to be truly based off of subdomains rather than any primary domain names immediately prior to .com. Although domains would not show up high in the list and subdomains would ultimately be more common because they are more general, the inclusion of primary domains would impact the accuracy of the overall counts. I literally went through all of the AWS Athena Presto documentation on how to do this after I exhausted options on StackOverflow and was unsuccessful with substring and charindex. I finally determined that “regex_extract” was my best path forward. I wish so badly that I was a regex master but I am far from it. Thankfully, I read Clint Gibler’s tl;dr sec newsletter this week and he briefly mentioned the website regex101.com which saved the day. About an hour later, I had the following query drafted using a regex that only matches on the left-most subdomain.

In all seriousness, did I miss something obvious or are regexes to parse only subdomains very scarce on the web?

Obstacle two: You may have noticed that I only included .com and did not include the subdomains of all top level domains (TLDs). I could easily expand by passing a list of TLDs into the query to increase to .org, .net, .edu, etc. but the reason I specified was that there are some TLDs that have second level domains as part of the TLD hierarchy and would impact the integrity of the randomness of the results (i.e. co.uk, <state>.gov). In these situations, I would be capturing the primary domain name into the subdomain results, which was what I wanted to avoid from the beginning, which led to the need for the regex.

Moving Forward

After overcoming the obstacles, I began running the queries. Here are some of the key metrics.

  • There are 1,239,457,832 total records ending with a .com TLD.
  • There are 636,081,667 total records containing a subdomain with a .com TLD.
  • There are 231,540,589 unique subdomains with a .com TLD.
  • Each query searched across around 33 GBs of data. AWS Athena costs $5 per TB of data to search so each query costs around 17 cents to run.
summarized results

The following are the queries, run times, and result counts to generate the metrics.

Wordlist Generation

Each Athena query saves the results into a S3 bucket as a .csv file. I proceeded to generate the following full wordlists and moved them into a dedicated folder within the S3 bucket.

  • Unique FDNS Any Subdomain List — 231,540,589 records
  • Unique FDNS A Record Subdomain List
  • Unique FDNS CName Record Subdomain List
  • Unique FDNS MX Record Subdomain List

To make these lists more consumable, I generated wordlists of the top 1,000 for consumption and made them available within the Straylight Github Repository.

For the future, I plan on automating this end-to-end process on a monthly basis to keep these wordlists maintained with the most relevant subdomains. I will make sure to capture the additional automation within GitHub when ready. To create these smaller wordlists, I did not want to keep re-running the same queries with a limit results constraint.

Instead, I renamed and combined all of the relevant results into a single S3 folder for further analysis.

Raw query results for wordlist creation

Querying the .csv files using a Jupyter notebook

For the next steps, I am going to utilize AWS SageMaker with a Jupyter notebook.

There are a lot of numeric entries within the wordlist which I have not parsed out as they are valid although I could make additional versions of the wordlists to cover both.

The code utilized to generate the wordlists from the Athena query output is:

The code in its current state is very memory intensive and kept crashing the Jupyter kernel. To get the larger datasets to complete, I temporarily increased the size of the virtual machine.

The wordlists utilizing the top 1000 words for each category All, MX, A, and CName records are available in the BrevityInMotion/Straylight GitHub repository. Although the “All” wordlist seems like it would be the best, I like the preliminary entries within the CName wordlist as it contains much fewer numeric subdomains.

I plan to continue to build on these initial lists using other data sources such as common crawl and will continue to add data-driven wordlists along with walkthroughs of how they were generated.

Let me know if you have enjoyed or found value in this article!

Follow me on Twitter!

--

--