How Much of a Problem are Broken Links in SE Research?
A while ago, I stumbled over this Academia.SE answer, which shows that papers in different academic fields suffer from link rot (that is, Web links that are not available anymore). Although the numbers obviously differ between fields and individual studies, about 30% to 40% of dead links seems to be par for the course for older published scholarly articles. Given that I am a notorious hyperlinker in my papers as well, this made me wonder how prevalent this issue is in software engineering research.
Collecting some data
First I needed to collect some data. To my surprise, there seems to be no real way to get access to complete conference proceedings even with institutional access. I decided to sample just a few papers from past editions of ICSE, arguably the most important conference in software engineering. I randomly sampled 40 full research papers from the 2017, 2014, 2011, and 2008 editions, and downloaded the PDFs from the ACM Digital Library to my computer.
Although a fairly small sample (160 papers overall), it was still clear that some automation is necessary to figure out how many of these papers contain dead links. By browsing the relevant literature (e.g, this article), it quickly became clear that reliably extracting and testing links from double-column PDFs isn’t quite as trivial as I originally hoped for. After some reading and testing, I built a small Ruby toolchain that:
- Converted each PDF to XML using PDFToHTML (this got rid of the annoying-to-parse double-column layout of ICSE)
- Extracted all plain text from the resulting XML
- Searched the plain text for http and https URI patterns using the Ruby standard library facilities
- Sent a HEAD request to each URI, following redirects as necessary
- Recorded the resulting response code
This process got rid of some issues I initially had, but it still led to a large number of false positives, as multi-line links would not be discovered correctly. Additionally, some papers have links that aren’t actually supposed to be followed (e.g., example namespaces in XML sample listings). For simplicity I decided to just manually clean up these false positives as good as I could (which is to mean my data is certainly not free from errors, but I feel ok about the data quality).
So what about those links?
The figure below illustrates how many links I ended up finding in total. It’s obvious that link usage has taken off quite a bit since ICSE’08 — in 2008, the average ICSE paper had less than 2 links, while in 2017 we seem to be up to 6.25 hyperlinks per paper. You may think that this is mainly due to DOIs picking up steam, but only a very small part of my data set were DOIs (in green in the figure) — most DOIs in papers actually don’t match the http[s] URI pattern I was searching for.
Incidentally, I noticed that in 2008 most hyperlinks were contained in the paper’s reference section (green in the figure below). Nowadays, people are much more happy to just link to stuff in footnotes in the main body of the text (in red). 2017 is missing here, because I had some troubles automatically parsing this information out of the 2017 PDFs.
Dead or Alive?
This leads to the question how many of those links are still alive. I filtered out the few DOI links above, leading to the distribution below.
A status of OK means that, after transiently following redirects, my tooling at some point received an HTTP response code of 200. Not OK groups everything else. This includes (primarily) 404 errors, when links are broken, 500 errors (server errors), and in a few cases 403 authorization issues. However, in some cases my tooling also quit due to expired TLS certificates and other issues that were not directly HTTP-related.
It’s interesting to observe that even this year’s papers already have about 5% of links that are not available. For links in 2008 papers this goes up to 25%. This is not nearly as bad as reported in some other disciplines, but still — a fourth of all links seems like a whole lot of unreachable data to me. Note that this analysis in no way checks for content drift, i.e., whether the page actually still contains the same information that the author referenced when writing the paper (read here for more information on this). Content drift is another can of worms which I can’t tackle in this blog post.
What was troubling me during this analysis was that it seemed to me that the links that become stale are predominantly the ones that we are most likely to actually care about. For instance, while links to well-known tools (e.g., Java, Amazon Web Services, or GitHub) tend to be reliable, it’s the links to data sets, prototypes, additional analyses, and similar artefacts that seem to be overproportionally prone to become unavailable (presumably when the first author of the paper graduates).
To put a bit of a number on that, I classified the links in my data set into ones that look like they are hosted at an university (i.e., ones that contain “.ac.”, “.edu.”, or “uni”) and ones that are not. The result of this slicing can be seen below. The effect is noticable, although not quite as bad as I feared.
To answer my initial question how much of a problem broken links are in ICSE research: it seems to be better than what other studies reported, but there is still a whole lot of broken links in SE research. And, unfortunately, links to data and artefacts seem to be particularly like to break. We should really get into the habit of using external artefact repositories, such as FigShare or Zenodo.
If you are interested in the data, or the tool chain that I used, this stuff is all available on GitHub (sans the paper PDFs, for obvious copyright reasons). If you have any more questions or comments, please either ask here or tweet at me at @xLeitix.