THM Google Dorking

Dehni

Published in

Dehni’s Notes

4 min readOct 25, 2023

THM Google Dorking Module Notes

What are Crawlers?

Crawlers discover content through various means. One being by pure discovery, where a URL is visited by the crawler and information regarding the content type of the website is returned to the search engine.

Another method crawlers use to discover content is by following any and all URLs found from previously crawled websites. Much like a virus in the sense that it will want to traverse/spread to everything it can.

In the diagram above, “mywebsite.com” has been scraped as having the keywords as “Apple” “Banana” and “Pear”. These keywords are stored in a dictionary by the crawler, who then returns these to the search engine i.e. Google. Because of this persistence, Google now knows that the domain “mywebsite.com” has the keywords “Apple”, “Banana” and “Pear”. As only one website has been crawled, if a user was to search for “Apple”…“mywebsite.com” would appear. This would result in the same behaviour if the user was to search for “Banana”. As the indexed contents from the crawler report the domain as having “Banana”, it will be displayed to the user.

However, as we previously mentioned, crawlers attempt to traverse, termed as crawling, every URL and file that they can find! Say if “mywebsite.com” had the same keywords as before (“Apple”, “Banana” and “Pear”), but also had a URL to another website “anotherwebsite.com”, the crawler will then attempt to traverse everything on that URL (anotherwebsite.com) and retrieve the contents of everything within that domain respectively.

Imagine if a website had multiple external URL’s (as they often do!) That’ll require a lot of crawling to take place. There’s always the chance that another website might have similar information as of that another website crawled — right? So how does the “Search Engine” decide on the hierarchy of the domains that are displayed to the user?

Answers:
1- index
2- crawling
3- keywords

SEO

Search Engine Optimisation or SEO is a prevalent and lucrative topic in modern-day search engines. In fact, so much so, that entire businesses capitalise on improving a domains SEO “ranking”. At an abstract view, search engines will “prioritise” those domains that are easier to index. There are many factors in how “optimal” a domain is — resulting in something similar to a point-scoring system.

There is a lot of complexity in how the various search engines individually “point-score” or rank these domains — including vast algorithms.

Aside from the search engines who provide these “Crawlers”, website/web-server owners themselves ultimately stipulate what content “Crawlers” can scrape. Search engines will want to retrieve everything from a website — but there are a few cases where we wouldn’t want all of the contents of our website to be indexed!(like admin login page)

Robots.txt

This file is the first thing indexed by “Crawlers” when visiting a website.

This file must be served at the root directory — specified by the webserver itself.

The text file defines the permissions the “Crawler” has to the website. For example, what type of “Crawler” is allowed (I.e. You only want Google’s “Crawler” to index your site and not MSN’s). Moreover, Robots.txt can specify what files and directories that we do or don’t want to be indexed by the “Crawler”.