Content Discovery: Automated and Manual
Recently I’ve completed TryHackMe’s Content Discovery room as well as the third challenge of their Advent of Cyber 3(2021) event and here’s what I’ve learned!
What is Content Discovery?
Content, in the context of web applications, refers to files, folders, pictures, pathways, and website features that were not intended for public access. For example: backup files, administration panels, and login portals intended for employee use only.
There are multiple different ways to discover content on a web application!
- Automated
- Manually
- OSINT(Open-Source Intelligence)
In this article I’ll be discussing two. That is, how to discover content manually and through the use of automation tools.
Manual Content Discovery
A great place to start discovering content would be the robots.txt file. Robots.txt is a document that communicates with search engine crawlers and tells them which pages they are and are not allowed to show in search engine results.
The robots.txt file of a website can be viewed like so:
https://example.com/robots.txt
Reviewing the contents of a robots.txt file will give you a great list of locations the website owners didn’t necessarily want discovered!
Sometimes HTTP headers can reveal useful information like the web server software and even the programming language in use.
Using the following command against your target website will output the headers! (If you don’t have curl installed on your linux distro you can learn more about it and how to install here.)
user@machine$ curl https://example.com -v
Automated Content Discovery
Automated discovery means using tools to discover content as opposed to doing it manually yourself. Using automation tools for this process allows you to make thousands, or even millions of requests to a web server at a vastly quicker pace than you would be capable of doing manually. Dirbuster is an excellent tool that can be used to automate the process of file and directory discovery!
Here’s how it works:
Dirbuster takes a word-list containing the names of the files/directories you’d like to search for, and then makes requests to the web server checking to see whether it exists on the website.
You can create your own wordlist that contains all the things you’d like to search for. For example, let’s say we create a .txt document titled wordlist with the following contents:
admin/
docs/
config/
If you provide Dirbuster with the URL of the website and the full path of your wordlist, Dirbuster will scan your target website for the folders listed within your wordlist! For example:
usr@machine$ dirb https://example.com /home/documents/wordlist.txt
You don’t always have to create your own wordlists! Here is a collection of open-source wordlists! There you can find a wordlist for default credentials as well as frequently used usernames and passwords!
This is what I’ve learned so far about Content Discovery! Thankyou for reading and I hope you were able to learn something too! You can get some hands-on practice by following TryHackMe’s Content Discovery Room!