The surprising subtleties of link checking

I recently needed to check the links in a dataset to find out if they were active or how they were failing. Flippantly, I thought that this sounded pretty easy; after all you could just script a GET on each link and record the result, then you’d know, right. Right?

Wrong.

This is a story about what I learned in the surprisingly subtle and imprecise world of link checking.

First, it’s worth noting that I couldn’t immediately obviously sign up to a link checking service — the dataset was “offline” so it wasn’t like I could have a link checker crawl my site and tell me the status of all the links. Maybe there’s a service out there that I could have used, but at the time I thought it would be quicker to just do it myself (I reckoned 30ish minutes of scripting) than try to find someone to do it for me.

I began with the following (in Python):

for row in records:
resp = requests.get(row.get("url"))
print(resp.status_code, row.get("url"))

This uses the brilliant requests library, and calls requests.get on every URL in the set of records (which were coming out of a spreadsheet), then prints the response.

What happened next was the script fell over, as requests threw an exception. I’d got an SSLError. One of the sites had something wrong with its SSL certificate; that might be a problem for that site, but it doesn’t make the URL wrong. We’d better stop verifying certificates or this will happen a lot. But not all SSLErrors will necessarily be because the certificate doesn’t verify, so we’d better catch any errors that aren’t verification ones:

for row in records:
try:
resp = requests.get(row.get("url"), verify=False)
print(resp.status_code, row.get("url"))
except requests.exceptions.SSLError:
print("SSL Error", row.get("url"))

That should be fine now, so I ran it again.

It seemed very slow, should it be that slow? I’d better kill the script and have a look.

Some sites are not responding, and I haven’t set a timeout, we’d better do that:

for row in records:
try:
resp = requests.get(row.get("url"), timeout=5, verify=False)
print(resp.status_code, row.get("url"))
except requests.exceptions.SSLError:
print("SSL Error", row.get("url"))
except requests.exceptions.Timeout:
print("Timeout", row.get("url"))

We’ve applied a short (5 second) timeout, and a catch for any URL that doesn’t respond in that time. This is great, because it means the script should complete in a reasonable time, but it also introduces our first bit of uncertainty into the process — what if a URL does resolve, but for some reason it takes 6 seconds to respond? Does a URL have to be responsive to be considered valid, and if not, how long should you have to wait to find out if it resolves?

Pushing those questions aside for the moment, lets push on and get our script to complete. Over and over I ran it, each time getting a little further before encountering a new error, which I added to the list of excepts. I won’t bore you with the details, and in the end our exception handling looks like this:

for row in records:
try:
resp = requests.get(row.get("url"), timeout=5, verify=False)
print(resp.status_code, row.get("url"))
except requests.exceptions.SSLError:
print("SSL Error", row.get("url"))
except requests.exceptions.Timeout:
print("Timeout", row.get("url"))
except requests.exceptions.ConnectionError as e:
print("Connection Error", row.get("url"))
except requests.exceptions.InvalidSchema:
print("Invalid Schema", row.get("url"))
except requests.exceptions.MissingSchema:
print("Missing Schema", row.get("url"))
except requests.exceptions.TooManyRedirects:
print("Too Many Redirects", row.get("url"))

There’s a few new things to unpack here:

  • ConnectionError is some non-specific issue with connecting to the URL. It acts as a bit of a catch-all for where a more-specific error doesn’t exist
  • InvalidSchema means our URLs are actually wrong, like someone misspelled http as htp or somesuch
  • MissingSchema means our URLs are totally missing the http(s) part
  • TooManyRedirects means that each time requests asked for the site it got a response with a 3xx status code which redirected us to another URL which also offered a 3xx status code and so on and so on until requests decided that it was being given the run around and gave up.

If we dig into the data of our responses we see some pretty interesting and difficult-to-handle issues emerge.

Certainly most of our URLs resolve with something helpful like a 200 or a 404, and those with errors like Invalid or Missing Schemas are easy for us to trace to our own bad data. It’s the Connection Errors and the Too Many Redirects that give us pause.

I wondered why I was seeing such strange errors. I’d been prepared for a lot of 404s, and perhaps a selection of 500s but I hadn’t expected so many other status codes. There were the redirects, of course, but in addition Forbidden (403) and Timeouts throughout.

I started to hit the failing URLs manually by placing them into my browser bar to discover something unexpected: many of the URLs resolved to pages reasonably quickly and effortlessly. No errors, certainly no infinite redirects. There were a few timeouts, and it was clear some of the pages had very long load times for whatever reason.

What’s going on here?

After batting it about a bit, experimenting with curl on the command line, and racking my brain, I hypothesised a number of possible explanations:

Intermittent Network Issues — It’s an intermittent problem somewhere in the network between us and the site. In our dataset we’re looking at some URLs that are hosted by organisations based in the developing world, and so network connection can vary in speed and quality.

Active Discrimination — The site is discriminating against machine-to-machine connections. This is more common than you’d think, with sites either trying to tune their content to the user based on HTTP headers, cookies, sessions, etc. Some sites explicitly try to block robots and other non-human actors from accessing their content, for example to prevent text scraping.

Temporary Technical Issues — The site might be experiencing a temporary or protracted technical problem.

The URL is actually broken — In some cases, when I put the URL into the browser it really didn’t connect, so we mustn’t forget that the URL might still be wrong!

By this point I had learned is that link checking is not the straightforward “does it resolve or doesn’t it?” task that I’d naïvely assumed. Instead, there are a number of pitfalls and uncertainties that you need to account for. I’m also of no doubt that the cases that I found (looking at around 10,000 URLs) are only a subset of the challenges out there.

Based on this experience, I came to understand a few things about link checking:

Link checking is not a one time activity— just because something doesn’t resolve today, doesn’t mean it won’t resolve tomorrow. You should try repeatedly over a period of time to access a URL before giving up on it

Timeouts are arbitrary constraints — be clear about what you’re checking for. If you find URLs that timeout, perhaps try them again with longer timeouts until you get to a point beyond which you can reasonably say is “broken”. How long this is is entirely subjective.

Mimic a real user — provide user agent strings, and other browser-like HTTP headers to trick the site into thinking you are a human. You could go so far as to have your script spin up a headless browser to send requests. Sites behave differently for different users and different browsers, and especially for robots. Your conclusions as a robot are not valid if you are link checking for humans.

Validate and clean your URLs — Your dataset should be nice and clean. If you are accepting user input, form validate it. If you are bulk ingesting data from spreadsheets or other sources, pass it through some kind of validation in your data.

What was great about this exercise — other than learning some interesting things — is that we took another step forward in having clean data in one of our client datasets, which can’t be bad!

Richard is Founder and Senior Partner at Cottage Labs, a software development consultancy specialising in all aspects of the data lifecycle. He’s occasionally on Twitter at @richard_d_jones

All things data: capture, management, sharing, viz. All-round information systems person. Founder at Cottage Labs. https://cottagelabs.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store