External IP domain reconnaissance and attack surface visualization in under 2 minutes.

Project Straylight

8 min readJun 26, 2020

As part 1 of this project #straylight accelerator series, I would like to introduce a repeatable process to visualize the external, global presence of a domain (and all corresponding subdomains). This article will detail the process to retrieve extensive raw DNS data, process it, and visualize it into consumable heat maps. The article will walk through the steps beginning from a selected target domain although it could be generated from any list of IP addresses.

Large companies could adopt this process for a continuous approach to monitoring their external attack surface. Often, it may cost thousands of dollars of external consulting and subscription services to obtain consumable, actionable information. This configuration can run fully in a cloud environment such as AWS and was built for under $5 (ensuring that the SageMaker instance is shutdown when not in use). Each domain query only costs pennies to run and process.

The Jupyter notebooks containing the full code can be accessed at https://github.com/brevityinmotion/straylight:

We will utilize the publicly available Forward DNS (https://opendata.rapid7.com/sonar.fdns_v2/) dataset provided by Rapid7 (https://www.rapid7.com/research/project-sonar/).

Let’s begin by highlighting the knowledge, data, examples, and learnings from the following sources that were incorporated into this walkthrough:

- A special thank you to Evan Perotti for the awesome walkthrough for querying the project Sonar FDNS and the query code for the Lambda APIs (http://securityriskadvisors.com/blog/creating-a-project-sonar-fdns-api-with-aws/).

- Thank you to Rapid7 for the availability of this valuable dataset (https://www.rapid7.com/research/project-sonar/) and the walkthrough for querying the dataset (https://blog.rapid7.com/2018/10/16/how-to-conduct-dns-reconnaissance-for-02-using-rapid7-open-data-and-aws/)

Example use case:

It is important to understand and manage the external attack surface of a company. This is often a difficult challenge for large organizations and a reconnaissance goal from a red team/bug bounty perspective. The value of reconnaissance methods increase by limiting the number of interactions with the true target. For this example, this information is generated completely passively by using freely available public datasets — i.e. let’s let someone else collect the information for us and then we can generate the insights that we need. We never touch or interact with the target domain.

For this demonstration, we are going to compare the corporate domains and datasets for the 5 largest cloud providers in the world (Amazon, Microsoft, Google, Alibaba, and IBM).

We are going to leverage cloud-based services to perform the analysis. Historically, it has been cost-prohibitive to purchase the amount of hardware, processing, and storage to search and process insights from terabytes and petabytes of data. By using cloud services, we can avoid the high capital costs and pay for only what we use.

Prerequisites:

Access to a Jupyter notebook environment. Recommended options include local installs: Anaconda (https://www.anaconda.com/); Amazon cloud: SageMaker (https://aws.amazon.com/sagemaker/); Google cloud: Colaboratory (https://colab.research.google.com/).
Access to an Amazon Web Services account for creation of appropriate IAM roles, S3 for storage of results, Secrets Manager for secure storage of API keys, and administration of Athena to create the queries. This walkthrough will utilize an AWS SageMaker Jupyter notebook.
A Google cloud API key restricted to the Maps Javascript API. If you do not initially see this in the Google Cloud Console list of services to restrict usage to, the Maps Javascript API service may need enabled within the marketplace.
A MaxMind GeoLite2 API key — Instructions for obtaining the key are at https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/.

The notebook code utilizes AWS Secrets Manager to prevent hard-coding them into the code. If you choose to follow the same route to avoid hard-coding, they should be stored with the following names to avoid source code adjustments. It is important to note that the default IAM roles for SageMaker include access to Secrets Manager secrets using the AmazonSageMaker-* format):

For the GoogleMaps API:

Secret name: AmazonSageMaker-gmaps
Secret Key: gmapsapi
Secret Value: <insert secret value>

For the MaxMind API:

Secret name: AmazonSageMaker-geoip
Secret Key: license_key
Secret Value: <insert secret value>

In order to utilize the Jupyter notebook, the Amazon Athena database and corresponding tables will need created by running the following queries within the Athena web console.

Amazon Athena queries

One caveat to displaying the embedded GoogleMaps inline with the notebook is to enable the widgetsnbextension and the gmaps extensions. Since SageMaker does not maintain persistence and the enabled extensions requires restarting Jupyter, it needs to be implemented via a Lifecycle configuration. This can be setup to run at startup of the notebook instance. The script to use can be copied from:

Once the Athena, SageMaker, and Secrets Manager environments are configured, the best option is to utilize and reference the Jupyter notebooks at https://github.com/brevityinmotion/straylight/tree/master/notebooks for the detailed and fully functional code. The remainder of this blog will highlight the key components of the process and the outcomes of the capability.

Querying the data

Using the Jupyter notebooks available in Github (https://github.com/brevityinmotion/straylight/tree/master/notebooks), we can begin retrieving and processing the data.

Notebook configuration

To begin, run the configuration.ipynb notebook (https://github.com/brevityinmotion/straylight/blob/master/notebooks/configuration.ipynb). It will:

Install the additional Python libraries (i.e. geoip, gmaps, wget). Not all of the libraries in the configuration notebook are required for this specific use case but will be needed for further accelerators in this series.
Add two folders in the notebook instance (tools and tmp).
Download the latest GeoLite2-City database, unzip it, move it to the tools directory, and cleanup the unneeded files.

Querying the target

Now, let’s begin using the tools-r7sonar.ipynb (https://github.com/brevityinmotion/straylight/blob/master/notebooks/tools-r7sonar.ipynb) notebook.

The code to prepare the query is:

The code to submit the query to the pre-defined Athena configuration is:

Once we receive the execution id information for the query, we can begin monitoring for the completion of the query:

The retrieval of the data takes several steps to make the process smoother. It first checks the Athena query and waits up to 5 minutes at 5 second loops until the query completes. Then, a pre-signed S3 url is generated to provide URL based authorization to the results. The return value of this function is the URL for downloading the query results in a .csv format.

Although it seems like a lot going on, this all typically processes in under one minute. Using *.microsoft.com as an example, within about 45 seconds we have the full set of results (18,695 DNS records) loaded into a Pandas dataframe for processing.

Results from searching for *.microsoft.com DNS records in Project Sonar dataset

In the next step, let’s process these entries using the MaxMind GeoIP database. The configuration.ipynb notebook would have downloaded the GeoLite2-City database locally for faster processing. The next function will take any IP addresses within the specified dataframe column and add the latitude, longitude, country, and state attributes to each row.

The coolest part about this query is that it is written to feed any column of IP addresses into it to output the mappings. Even though we have taken extensive steps to get this point, you could reuse this function for geo-mapping any list of IP addresses.

The key point of this code snippet is that it doesn’t matter what the structure of this dataframe is as long as you specify the dataframe column holding the IP addresses. In this case, it is called ‘value’ but could be anything.

The following function is where the IP address to geolocation mapping magic happens.

Mapping the 18K+ IP addresses only takes around 1 minute to process. The function returns a dataframe looking like the following:

Now that we have all of the data and corresponding metadata into a dataframe, we can aggregate and build insights on the data. In preparation for mapping, the prepare_location function in the previous code sample can be called to aggregate the records into latitude and longitude groupings for plotting.

The resulting, normalized dataframe will look like:

With this data, we will pass the entire dataframe to a heatmap function:

The output produces the following interactive, embedded heatmap directly into the Jupyter Notebook. Although it is a little difficult to see all of the lighter green areas in this article, when run directly, the mapping can be zoomed for more granular analysis.

Microsoft.com heatmap — 18,695 records processed

Once this base configuration is setup, we can run additional queries against any domain that we want from start-to-finish in about 2 minutes. Here are the comparisons between the other top 5 cloud global cloud providers.

Amazon.com heatmap — 21,096 records processed

It is interesting that this distribution closely aligns with Microsoft’s although there is a large presence in South Africa for Amazon.com.

Google.com heatmap — 54,508 records processed

Google’s results returned a much heavier presence in California with limited to no coverage in the Seattle region.

Alibaba.com heatmap — 87,263 records processed

Alibaba, as expected, has a much greater presence in China.

IBM.com heatmap — 54,684 records processed

The mappings for IBM differ quite a bit from the others as it indicates a much higher concentration of presence in the eastern United States.

Conclusion

Although it is neat to have the heat maps, there is tremendous value in the raw data obtained through this process. An enterprise can utilize this for asset management, triage, visibility, attack surface reduction, and general awareness of its externally facing presence. It is also important to be aware that an adversary also has access to this information. This level of detail for tens of thousands of records can be collected, processed, and analyzed in under two minutes using this process.

As we progress through this accelerator series as part of project straylight, I am excited to share a few additional examples in the coming days and weeks using large datasets as well as much smaller use cases and tasks such as generation of metrics and integration of technologies.

Feel free to reach out with any questions on this. Enjoy!