Finding 404s in your site with wget

Austin Barrett
Aug 28, 2017 · 2 min read

A few weeks ago I was tasked with redesigning/simplifying our company website. However, after reimplementing the first 30 or so pages, I began to run into invalid routes. 404s all over the place. PDFs not downloading, Javascript files not loading, so on and so forth.

To combat this I decided to write a bash script to extract the URLs from src and hrefs, then pass them to to cURL to perform a HEAD request. After curling the page header the script would grep for 404s. as you could imagine, this was becoming tedious and was not perfect. Enter wget.

In On A Horse Rides A Knight In Shinning Armor

After some digging I found a much better solution. Using wget you can spider a website. This can be done like this. wget --spider -o ~/example.com.log -e robots=off -w 1 -r -p http://www.example.com.

What it does:

  • --spider, this tells to preform a HEAD request i.e. do not download the content.
  • -o ~/example.com.log, tells wget where t0 save command output (Headers and responses from the request).
  • -e robots=off, tells wget to ignore the robots.txt file. Info about robot.txt can be found at http://robottxt.org .
  • -w 1, tells wget to wait 1 second between request, lessening server load.
  • -r, tells wget to recursively follow links.
  • -p, get all page requisites (i.e. images, css, js, etc.) needed to display HTML page. This is how we can find broken image links too.
  • http://www.example.com, URL to crawl.

Reading the logs

You may be looking at the log file and thinking to yourself “What the he🏒🏒 is this?” Grep to the rescue.

grep -B 2 ' 404 ' ~/example.com.log

This will show all of the 404 status codes and the 2 lines before it. We want to see these lines so that we can view the broken link(s).The output should look something like this (unless you have no 404):

--2017-08-11 11:02:42--  http://localhost:3000/static/0/image/banner.jpg
Reusing existing connection to localhost:3000.
HTTP request sent, awaiting response... 404 Not Found

Unfortunately this doesn’t show you where the broken link is located, but it at least it gives you the broken links

Further Resource

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade