Automating the collection of information from company websites

CyberRaya
7 min readMar 12, 2024

--

Today we’ll talk about simple ways to automate the collection of contact information from target company websites — emails, phone numbers, cryptocurrency wallet addresses, social media links and documents. In general, everything that can be used as entry points into an investigation.

Most of the techniques described below can also be applied to gather information about sites related to person of interest, but we on focus on corporate intelligence.

In this article, you will learn how to use the following tools:

  • Nuclei
  • MetaDetective
  • WayBackUrls
  • WayBackMachine Downloader
  • Netlas

If you don’t have Python, Go and Ruby installed on your computer (or are not sure if you do), you may use Gitpod to run examples below.

#1 Getting a list of subdomains using Netlas

If you are sure that the company has just one website, you can skip the first two steps.

There are different ways to find company related websites. The first is to search for subdomains.

Open Netlas Response Search in your browser and type to search field:

host:*.lidl.com

Then, click to Download results icon (left), select CSV format, enter file name and number of results, select the fields you are interested in (host is necessary, other fields is optional.)

Google Sheets

Import table to Google Sheet and remove dublicates (in IP field). Click Data -> Data cleanup -> Remove dublicates. You can also use Excel, Numbers, or other analogs.

You can also automate the search for subdomains using the Netlas Python Library. You can read more about this in Netlas CookBook.

#2 Other ways to collect company-related sites using Netlas

You can also search for potentially related sites using other queries. Here are some examples.

app.netlas.io

Search by organization name in Domain Whois Netlas search

“GitHub, Inc.”

Search for sites that are served by company-owned mail servers in DNS Netlas search:

mx:*.parklogic.com

Search for sites that are served by company-owned name servers in DNS Netlas search:

ns:ns?.parklogic.com

There are other ways to find potentially related sites: search by Google Analytics ID, search by emails in Whois contacts, search by contact information in SSL certificates, search by Favicon. You can read more about them in Netlas CookBook.

Please note that in each example we use sites of different companies, as it is difficult to find a company for which all methods would work at once.

Similarly, you can search for related sites in other IP search engines: Shodan, Censys, Fofa, ZoomEye etc.

So far we only have a list of addresses. Now our task is to get as many links as possible to web pages that may contain information useful to the investigation.

#3 Getting a list of site URLs using WayBackUrls

Gitpod.io

Let’s start by collecting the URLs stored in archive.org and available in the Archive.org CDX API.

Put list of domains (from domain column of CSV Netlas export) in domains.txt file.

Install WayBackUrls:

go install github.com/tomnomnom/waybackurls@latest

Run WayBackUrls:

cat domains.txt | waybackurls > wayback_urls.txt

#4 Other possible ways of getting a list of site URLs

Unfortunately, it is often not possible to find all URLs available on a company’s website this way, because arhive.org is missing a lot of pages and sites. There are couple of other ways:

Once you have collected links from different sources, make sure that only unique ones are listed.

Merge all files (don’t forget about domains.txt, collected in step #1):

cat wayback_urls.txt duckduckgo_urls.txt gobuster_urls.txt, domains.txt> merged_urls.txt

Remove dublicate string from merged_urls.txt:

sort merged_urls.txt | uniq > urls.txt

Now let’s automate the analysis of these URLs and try to automatically extract the most useful information for the investigation from them.

#5 Extract contact information with Juicy Info Nuclei Templates

Nuclei was originally created to scan websites for various vulnerabilities. It is one of the fastest web scanners in the world! But it can also be used to extract different information of websites using regular expression patterns.

Install Nuclei:

go install -v github.com/projectdiscovery/nuclei/v3/cmd/nuclei@latest

Download “Juicyinfo” Nuclei Templates:

git clone https://github.com/cipher387/juicyinfo-nuclei-templates

Let’s try to extract emails from the html code of web pages using a list of links:

nuclei -t juicyinfo-nuclei-templates/juicy_info/email.yaml -l urls.txt

One email was found in the picture for the reason that I reduced the list of urls to 200, so as not to spend a lot of time creating an example for the article. In the case of tens of thousands of pages, there are often many more of them.

Juicy Info Nuclei templates can also be extracted:

  • social media links — facebook.yaml, github.yaml, gravatar.yaml, linkedin.yaml, telegram.yaml, twitter.yaml, youtube.yaml;
  • possible nickname handlers — nickname.yaml;
  • possible phone numbers — phonenumber.yaml;
  • any links — urls.yaml;
  • ip addresses — ipv4.yaml;
  • links to image — images.yaml;
  • cryptocurrency wallets addresses — bitcoin_address.yaml and folder juicy_info_cryptocurrency.

If you want to use multiple templates, specify the paths to them using commas.

#6 Extract links to documents with Juicy Info Nuclei Templates and download it

It is necessary to talk separately about searching MS Office and PDF documents, in which you can often find very important information about the company.

The pdf.yaml and officedocuments.yaml templates are used to find links to documents:

nuclei -t juicyinfo-nuclei-templates/juicy_info/pdf.yaml -l urls.txt

Unfortunately, links to documents within html code are written in different formats — localhost/file.pdf, //file.pdf, downloads/file.pdf, file.pdf etc. Unfortunately, I don’t know how to describe in a nutshell how to automate cleaning such a list (hint: use regular expressions).

So, let’s just manually edit the Nuclei URLs found to https://targetsite.com/file.pdf (xlsx, docx etc) format and save them to files_urls.txt.

And load them using curl:

xargs -a files_urls.txt -I{} curl -# -O {}

As an example, I’ve added links to documents from different sites to the file. This is to show you the MetaDetective tool’s work in the next section.

#7 Extract potentially useful information with MetaDetective

MetaDetective is a simple Python tool that analyzes the metadata of files in a directory and collects the most important information from them.

Install MetaDetective:

pip install MetaDetective

Install exiftool:

sudo apt install libimage-exiftool-perl

Run MetaDetective:

MetaDetective -d /workspace/company_information_gathering_automation/

This will display the names of the users who worked on the documents, as well as the software used.

The tool also allows you to analyze the metadata of individual files, including those located on other servers and scan urls from website html-code. That is, you could not download files (unless, of course, you have other tasks), but run MetaDetective for some URL:

python3 src/MetaDetective/MetaDetective.py --scraping --scan --url https://example.com/

#8 Extract contacts and links to documents from sites that are no longer available with Netlas

If your goal is to collect as much company data as possible, it’s also worth trying to find contact information on deleted domains and web pages. There are at least two ways to do this.

First — download old versions of the pages from archive.org.

Install wayback_machine_downloader:

gem install wayback_machine_downloader

And run it:

wayback_machine_downloader http://sector035.nl

You can now run Nuclei to scan local files:

nuclei -u /workspace/company_information_gathering_automation/websites  -t juicyinfo-nuclei
app.netlas.io

Second way — download web page bodies from Netlas Search Results. Just select field http -> body in export options.

After that, you can extract data from this file using any regular expression tools: Grep, Python Re package and any scraping Python packages like Beautiful Soup. You can read more about this in Netlas CookBook Scraping section.

That’s all for today. These were the simplest methods of gathering information about a company. Perhaps in the next articles I will tell you in detail about automating document analysis with AI and finding hidden files on websites with GoBuster, as well as many other OSINT techniques.

--

--

CyberRaya
CyberRaya

Written by CyberRaya

Open Source Intelligence (OSINT), data visualizations, and data science. Sharing knowledge one data point at a time.

Responses (1)