To combat this I decided to write a bash script to extract the URLs from src and hrefs, then pass them to to cURL to perform a HEAD request. After curling the page header the script would grep for 404s. as you could imagine, this was becoming tedious and was not perfect. Enter wget.
In On A Horse Rides A Knight In Shinning Armor
After some digging I found a much better solution. Using wget you can spider a website. This can be done like this.
wget --spider -o ~/example.com.log -e robots=off -w 1 -r -p http://www.example.com.
What it does:
--spider, this tells to preform a HEAD request i.e. do not download the content.
-o ~/example.com.log, tells wget where t0 save command output (Headers and responses from the request).
-e robots=off, tells wget to ignore the
robots.txtfile. Info about robot.txt can be found at http://robottxt.org .
-w 1, tells wget to wait 1 second between request, lessening server load.
-r, tells wget to recursively follow links.
-p, get all page requisites (i.e. images, css, js, etc.) needed to display HTML page. This is how we can find broken image links too.
http://www.example.com, URL to crawl.
Reading the logs
You may be looking at the log file and thinking to yourself “What the he🏒🏒 is this?” Grep to the rescue.
grep -B 2 ' 404 ' ~/example.com.log
This will show all of the
404 status codes and the 2 lines before it. We want to see these lines so that we can view the broken link(s).The output should look something like this (unless you have no
--2017-08-11 11:02:42-- http://localhost:3000/static/0/image/banner.jpg
Reusing existing connection to localhost:3000.
HTTP request sent, awaiting response... 404 Not Found
Unfortunately this doesn’t show you where the broken link is located, but it at least it gives you the broken links
- Wget man page: https://linux.die.net/man/1/wget
- Grep man page: https://linux.die.net/man/1/grep
- Use brew to install wget on mac: https://brew.sh/
- Wget for windows: https://eternallybored.org/misc/wget/ (please note I am not a windows user and have not verified these packages, use at your own risk)
- Grep for windows: http://gnuwin32.sourceforge.net/packages/grep.htm
- Alternative ways to get WGET and GREP chocolatey package manager, or cygwin