Dark Web Scraping by OSINT - Scraping & Tools

Tushar Suryawanshi
7 min readJan 16, 2022

--

➢ Dark Web Scraping

Source

Web Scraping/harvesting

  • It is used for data extracting from the websites.
  • Its automated process that implemented using a bot or crawler
  • Once information is collected and then export into the form that is more useful to the users.

The web scraping process

1. Identify the targeted website

2. Collect all the URL’s of pages where you want to extract data from

3. Request the URL’s to get the HTML of the webpage

4. Use locators to find the data in HTML

5. Save data in a JSON, CSV file, or some other structured format

Which Data Can You Scrape?

With the help of Web Scraping’s dark web data mining, you can able to scrape or extract the data mentioned below:

Brand counterfeiting, Cryptocurrency transactions, Illicit Drug Trafficking, Censored Social Media Information, Illegal product data, Detect data leaks, Monitor Financial Frauds, Access Hidden Content, Gambling, Hidden Wiki, Darknet Blog data, Stolen Medical Data, Blogs & Forum, Detection of Phishing and scams, Fraudulent Activity, Hacking.

  • The Crawler is a software program that traverses the information from the WWW (World Wide Web) which is requested by a user by following hypertext links and helps them to retrieve web documents pages by the standardized HTTP protocol.
  • Crawlers have many uses in different applications and research areas, especially in search engines, which aim to gain up-to-date data, and where crawlers create a copy of all pages, they visit for later processing
  • Another common use of Crawlers is Web Archiving, where crawlers collect and archive huge groups of pages periodically for the future benefit
  • When designing a crawler, we must be aware enough of the characteristics of the crawled network. For the crawler to be able to access the Tor network anonymously, proxy software should be used to provide a proxy connection on HTTP protocol without saving any data cache about the currently occurring connection, and this proxy connects the crawler with the Tor network.
Proposed System Architecture
  • Here, it’s an example of a developed system using Scrapy1 (written in Python programming library),
  • It is with a connection to dark websites on Tor network through Tor software integrated with VPN to ensure the security, confidentiality, and anonymity of the crawler against the sites, by relocating the IP addresses.
  • After the Tor-Proxy has been established, it operates the crawler starting from the website URL, and it processes the login interface with credentials.
  • It designed like the crawler especially for the website under study, according to its structure and the hyperlinks structure among its pages,
  • i.e. we must have to customize a different crawler design for each website for different interfaces handling methods & different HTML structures.

Challenges faced during web scraping

The availabilities to index a webpage may be corrupted due to many issues, as:

  • The owner of the particular webpage has protects it with a password so it prevents the crawlers to access it.
  • accessing times of the webpages be specified so it may restrict to reach the page, and after that particular number, the page might become unavailable before the crawler reaches it.
  • The robots.txt file of the website is set to tell the crawler not to crawl on that site or different parts of it.
  • The page is hidden or maybe unlinked from any other page on the website or other servers, so it is unreachable unless the whole URL is well known.

➢ OSINT Tools for the Dark Web

Source

1) TORBOT

  • It is an OSINT tool for the dark web & Developed in python
  • The main goal of this tool is to collect/gather all information from the dark web with the help of data mining algorithms, This also helps too much in possible data finding & produce a tree graph
  • Onion Crawler (. onion)
  • Returns title of the pages and address with little bit description of the website
  • Save the links to the database, Get emails from the site, Help in saving the Crawl information to JSON file
  • Crawl the custom domains, Integration of Social Media

2) Dark scrap

  • This OSINT Tool is for finding the available Media Links in Tor Sites
  • Downloadable Media, Easily Scrape From Single URL & Files
  • Face Recognition methods

3) Fresh Onions

  • Find hidden services from many Clearnet sources
  • Optional full-text Elasticsearch support
  • Finds SSH fingerprints & email addresses across hidden services
  • Finds bitcoin addresses across hidden services
  • Shows the incoming & outgoing links through onion domains
  • Up-to-date live hidden service status
  • Port scanning
  • Search for “interesting” URL paths, which is useful 404 detection
  • Automatic language & Fuzzy clone detection

4) Onioff

  • A simple and easy tool — written in python — to examine Deep Web URLs

5) Tor Crawl

  • Tor Crawl used not only crawls to hidden services on tor but it also helps in extracting the code on the services’ webpage

6) Photon

  • Fast Crawler Designed for OSINT, Photon is like a fast Crawler & OSINT check tool
  • Using Photon we can easily verify different types of online resources & information regarding the target

This tool also has add-ons like,

a) dnsdumpster.com b) findsubdomains.com c) web.archive.org

The tool extracts the following data while crawling:

a) URLs with parameters (example.com/gallery.php?id=2) b) emails, social media accounts c) Various types of Files d) Secret keys e) files of JavaScript & Present endpoints f) Subdomains Information & data related to DNS

7) Hakrawler

  • Simple, Fast Web Application Crawler
  • This tool is helping in for easy, quick discovery of endpoints in web applications.

It can be used to discover:

a) Forms b) Endpoints c) Subdomains d) JavaScript files

8) OSINT-SPY

  • A tool to Search using OSINT
  • This is a tool that we can use for performing OSINT scans on online resources and check information for the email address, domain, IP address, and organization. It can gather information easily
  • This tool is used by security Researchers, Penetration Testers, and cybercrime investigators to find confidential information of targeted victims
  • We will get a full name, given name, gender, employment details, social profiles, photos

9) Gasmask

  • OSINT Information Gathering Tool
  • Gasmask is an all-in-one OSINT tool. It’s used by Bug hunter, Penetration Testers, and other cybersecurity researchers to collect information from publicly available sources
  • This tool gathers information from Ask, cessys.io, Bing, dnsdumpter, virus total
  • There are two modes in this tool basic mode and non google mode

10) h8mail

  • It’s a Password Breach Hunting tool and Email OSINT Tool
  • H8mail can find through over billions of weak credentials to discover the password
  • It can be used to find plain text passwords massive data breaches using only a person’s email. The default source is Scylla

11) Skip tracer

  • Tool for OSINT Scraping Framework
  • Starting attack vectors for recon normally we need to pay to get data mining results (Maltego).
  • Skip tracer is build to query and parse 3rd party services in an automated fashion, to increase productiveness, while conducting background research & investigation.
  • Using this tool we can gather Licenses plate OSINT data and we will get vehicle information for using we

12) Final Recon

  • All-In-One Web Reconnaissance OSINT Tool
  • This tool is mainly used for web reconnaissance. It is a fast and simple python script tool.

This tool can extract the data such as,

a) Header Information b) WHOIS c) SSL Certificate Details d) Crawler

➢ Conclusion

  • The Dark Web has proven a very useful and reliable tool in the hands of individuals wishing to be involved in illegal, criminal, or terrorist activities, child trafficking, pornography setting sights on getting great economic or political benefits without being identified by government authorities and security agencies worldwide.
  • To this end, law & enforcement agencies need to become more agile when dealing with criminality on the Dark Web, and in particular on its Hidden Service Markets, and need to spend on new training and technologies.
  • Current technological advancements and research efforts in the fields of Information Retrieval, Network Analysis, and Digital Forensics provide LEAs with numerous opportunities to overcome the limitations and restrictions that the anonymous nature of the Dark Web imposes, so as both to prevent criminals from taking advantage of the anonymity veil existing in several darknets and to suppress any illegal acts occurring in these networks.
  • ICT technologies have reached a substantial level of maturity to reliably support the LEAs, therefore the tools provided must be exploited in day-to-day real-world investigations in the upcoming years.
  • At the same time, it is imperative that LEAs also ensure the proper use of these technologies, to protect freedom of speech and human rights for users who exploit the Dark Web anonymity with intentions beneficial for the society. In this context, it is clear that gathering OSINT from Dark Web is an issue of vital significance for the ongoing effort to diminish the potential threats which imperil modern societies.

--

--