PDF might be a bit clunky, but here’s why we shouldn’t be taking it for granted

We all know very well what can happen when we try to access a blog post, article or other web content from several years ago. It’s a hit-and-miss situation: Even if it’s is still up, the page may have moved, links may no longer work, and content hosted elsewhere, such as images or videos, might be missing. The old text will likely be formatted completely differently, surrounded by dynamic content, such as current ads and related articles, being displayed out-of-context.

Fortunately, this does not happen very often in the scientific world, as academic articles are normally distributed as PDF files. As we all know, PDF’s main strength is that it preserves the appearance of the original document; not only across different operating systems and screen sizes, but also over several decades — it is backwardly compatible. But there is a further benefit, at least the way PDF is normally used: PDF is tangible: it is self-contained; once you’ve got the PDF, you’ve got the whole article. And normally it stays unchanged.

Because of this, I would argue that PDF has been a blessing for the archivability of the scientific Web.

For several years, there has now been a push by several members of the scientific community to wean authors and publishers away from PDF and towards better structured formats based on XML and HTML. The Force 11 (previously “Beyond the PDF”) series of conferences is an example of such an initiative. There are some very good reasons for this, as the added versatility makes it easier to extract data from articles and add inline comments for peer review, etc.

However, in defining the formats for future scholarly communication we need to be careful what we wish for. The push for a more versatile, up-to-date format is legitimate, but we need to be careful that, in doing so, we don’t forgo the advantages PDF is currently giving us.