Finding 404s in your site with wget

A few weeks ago I was tasked with redesigning/simplifying our company website. However, after reimplementing the first 30 or so pages, I began to run into invalid routes. 404s all over the place. PDFs not downloading, Javascript files not loading, so on and so forth.
To combat this I decided to write a bash script to extract the URLs from src and hrefs, then pass them to to cURL to perform a HEAD request. After curling the page header the script would grep for 404s. as you could imagine, this was becoming tedious and was not perfect. Enter wget.
In On A Horse Rides A Knight In Shinning Armor
After some digging I found a much better solution. Using wget you can spider a website. This can be done like this. wget --spider -o ~/example.com.log -e robots=off -w 1 -r -p http://www.example.com.
What it does:
--spider, this tells to preform a HEAD request i.e. do not download the content.-o ~/example.com.log, tells wget where t0 save command output (Headers and responses from the request).-e robots=off, tells wget to ignore therobots.txtfile. Info about robot.txt can be found at http://robottxt.org .-w 1, tells wget to wait 1 second between request, lessening server load.-r, tells wget to recursively follow links.-p, get all page requisites (i.e. images, css, js, etc.) needed to display HTML page. This is how we can find broken image links too.http://www.example.com, URL to crawl.
Reading the logs
You may be looking at the log file and thinking to yourself “What the he🏒🏒 is this?” Grep to the rescue.
grep -B 2 ' 404 ' ~/example.com.logThis will show all of the 404 status codes and the 2 lines before it. We want to see these lines so that we can view the broken link(s).The output should look something like this (unless you have no 404):
--2017-08-11 11:02:42-- http://localhost:3000/static/0/image/banner.jpg
Reusing existing connection to localhost:3000.
HTTP request sent, awaiting response... 404 Not FoundUnfortunately this doesn’t show you where the broken link is located, but it at least it gives you the broken links
Further Resource
- Wget man page: https://linux.die.net/man/1/wget
- Grep man page: https://linux.die.net/man/1/grep
- Use brew to install wget on mac: https://brew.sh/
- Wget for windows: https://eternallybored.org/misc/wget/ (please note I am not a windows user and have not verified these packages, use at your own risk)
- Grep for windows: http://gnuwin32.sourceforge.net/packages/grep.htm
- Alternative ways to get WGET and GREP chocolatey package manager, or cygwin
