Search the html across 25 billion websites for passive reconnaissance using common crawl

6 min readJul 4, 2020

For this next accelerator as part of project straylight, we will walkthrough configuring and searching the publicly available Common Crawl dataset of websites. Common Crawl is a free dataset which contains over 8 years of crawled data including over 25 billion websites, trillions of links, and petabytes of data.

Why would we want to do this? It is a passive reconnaissance capability that could be used for searching the html of #bugbounty targets, retrieving historic versions of pages, identifying parameters, finding javascript references across the internet, insecure script tags, code comments, tracking page changes over time, and many other analytic use cases. Just think how powerful Google search is. Now imagine being able to search the html of those pages. This is what you will be able to do by the end of this article.

A special thank you to the Common Crawl team for this amazing dataset! Also, many of these steps were either adopted or modified based on this great article by Sebastian Nagel.

Prerequisites

To begin, we are going to utilize the following resources:

Amazon Web Services (AWS) Athena — Utilized to query the common crawl dataset.
Amazon SageMaker (for a Jupyter notebook environment) — You could also use a local install of Jupyter or Anaconda.
Amazon S3 — To store results from Athena queries.
Additional IAM roles and policies for communications between Athena, SageMaker, and S3.

The Jupyter notebook being utilized in the article can be downloaded from https://github.com/brevityinmotion/straylight/blob/master/notebooks/tools-commoncrawl.ipynb. There are some additional notebooks in this directory that will be referenced for common functions.

Walkthrough

The first step is to configure AWS Athena. This can be performed by the execution of the following three queries:

Once this is complete, you will want to run the configuration.ipynb notebook to ensure that all dependencies are configured and installed for the notebook environment. There are some additional libraries that will load to support other walkthroughs in this series and may not be directly related to Common Crawl.

The first step is to configure the domain and then run the query within the tools-commoncrawl.ipynb notebook. You will want to add the specific domain name you want to query as well as the S3 bucket that your Athena results are stored in. If you are not sure what this is, you can find the bucket location if you open Athena within the AWS console, click Settings — and the bucket path to use will be defined in the ‘Query result location’ setting.

The code to prepare the query for Athena is:

The code to execute the query for Athena is:

Once we receive the execution id information for the query, we can begin monitoring for the completion of the query:

The retrieval of the data takes several steps to make the process smoother. It first checks the Athena query and waits up to 5 minutes at 5 second loops until the query completes. Then, a pre-signed S3 url is generated to provide URL based authorization to the results. The return value of this function is the URL for downloading the query results in a .csv format.

Once the .csv results are downloaded and loaded into a dataframe, the output will look like:

For this instance, we utilized the domain, ‘derbycon.com’ as it provides a broad subset of historical dates, multiple subdomains, and a smaller number of URLs. The current query configuration treats it as a wildcard domain in order to capture all subdomains. Keep in mind that this is completely passive to the target domain as the queries are against the dataset and not the domain environment.

This next step is where the primary magic happens although it can take a while for completion. It is important to note the count of URLs above. The 580 records will take just over 1 minute to complete so larger domains (i.e. thousands of records) will take much longer. A future enhancement would be to add better status updates into the notebook because the current indicators are just to monitor the timestamps on the files being written to the filesystem or waiting for the hourglass in the Jupyter notebook browser tab to disappear. I am sure that there is also a better and more efficient way to do this.

The first part of the code was utilized from this article by Sebastian Nagel. The ArchiveIterator loop was customized to write the retrieved content to disk. It will create a directory structure of the site with the html files saved to the respective directories. If there are multiple versions of the file cached through the years, it will write each separate version using the offset as the filename (i.e. /path/<offset>.html). This also helps account for the frameworks where the URLs do not explicitly have an extension defined. Multiple versions can be insightful to review changes over time, possibly leading to vulnerability discovery if the update included any security modifications, references to stale APIs, code comments, hard-coded data, etc.

The output on the filesystem will look like:

If you are running locally, the next step may not be necessary, but if you are running this in a AWS SageMaker, the files are written into the ‘notebooks/tmp’ directory of the notebook instance. We will want to pull the files from the instance for further analysis and processing. The following code will tar the files into a packaged download.

The file will reside in the same directory with the notebook and is named ‘domainoutput.tar.gz’.

Once you download and extract the file, you can begin working with the data. Suggestions would include grepping for keywords, comparing differences between the same pages, extracting all comments, scripts, and references, and running through any other SAST analysis tools.

How cool is this!? This is essentially a basic wget against historical cached html content.

Next Steps

Certainly there are many improvements that will be made as I iterate on ideas while working through any bugs encountered. Updated notebooks will be maintained in GitHub. Hopefully this will get you on the right path.

Lastly, once the initial configuration is completed in AWS Athena, the query utilized in the Jupyter notebook can be modified for more tailored queries. You can also run queries directly in the Athena console and then download the results from the interface and load them into Jupyter for records retrieval.

Here are some additional example queries:

The url_query field is particularly interesting as some of the sites captured contain the query strings. You could leverage the example query to create brute-forcing indexes of identified parameters, or search for parameters that may indicate a vulnerability such as ‘cmd=’. A test query for the cmd= parameter returns 3,462 sites across 354 GBs of indexes in 2 minutes and 7 seconds. Keep in mind, content discovered using these processes may be prior versions and not reflect production.

Another consideration when working with the html files written to disk, if you open them in your browser, they will make the callouts to load the resources and page elements, no longer keeping this passive. To avoid this scenario, work with the content in your favorite text editor and avoid the browser.

Let me know if you have found value in this and feel free to reach out with any questions. You can follow me on Twitter.

Search the html across 25 billion websites for passive reconnaissance using common crawl

Prerequisites

Walkthrough

Next Steps

Written by Ryan Elkins