Scholarly publishing is stuck in 1999

I’ve spent the majority of my career to date working on products in the scholarly publishing and information industry. Lately I can’t shake a frustrating feeling that technologically it just isn’t evolving.

The digitisation of scholarly publishing from the late 1990s transformed the industry and the working lives of the world’s students and researchers. Scaling constraints on publishing, distribution and readership were swept away. An explosion of growth and consolidation followed, and the volume of research written and read soared ever upward.

Two decades of dizzying technological change have passed. But research publishing seems stuck with those that were employed when it first went online. The majority of content is authored as MS Office documents and submitted and reviewed via dated enterprise software systems which even the largest and best-resourced publishers have struggled to update. A micro-industry of typesetting vendors bash the material into XML files based on complex proprietary DTDs. Distribution platforms spend small fortunes wrestling this into serviceable HTML, while many readers still regard the venerable PDF file format as a gold standard of permanence and portability.

This landscape has taken on an air of immutability, almost impossible to substantially change, rarely even questioned. Yet consider, thirty years ago nobody had heard of a ‘PDF’. None of the great scientists and thinkers of the past ever wrote a paper in Microsoft Word, or had it typeset into XML. Processes now regarded almost as laws of nature date back no further than the millennium.

Innovators fiddle around the edges with recommenders or annotators or supplemental media, but the fundamental mechanics of writing, reviewing and reading are barely affected. I see little in my own career that wasn’t possible fifteen years ago, an endless exercise is developing and redeveloping the same things. It’s kind of depressing.

Reading a recent article in Nature about the ‘reproducibility’ of scientific experiments I was struck by how basic some of the problems discussed seem. Not knowing how to raise a query or report an error in a published paper. Not being able to deal speedily with the issue. A lack of incentives and in some cases financial barriers to correction. Not being able to request or obtain the data that supports the findings. Surely this can’t be difficult to resolve?

Software development has an excellent and well-worn solution to these kind of problems. Anyone can raise an issue with an open source software project hosted on GitHub or equivalent and suggest a fix. The project authors can discuss, accept or reject the issue in a completely public, auditable way. Changes can be made instantly, and all previous versions and the differences between them are preserved for all to see (I’m always bemused when it is suggested that science needs to flirt with the technological dark matter of blockchain to achieve that). Anyone can check out the project, run it themselves — and report an issue if the protocol (the ‘readme’) doesn’t actually work. There’s also an ecosystem of kudos, reputation and career reward built around contributions and public profiles. Why doesn’t research publishing look more like this?

I’d suggest part of the reason is that the1990s era tech chain just doesn’t support it. Content which has to go on a round trip of Word documents and submission systems and typesetters and platforms isn’t easy to change or version or discuss. To substantially improve things perhaps we should think what’s become oddly unthinkable, ditch it all and start again.

For example those readme files on GitHub are written in Markdown, a simple text-based format which could be learned by an eight year old, and which is fairly easy to expand with more specialised vocabularies for things like maths and citations and code. Open source tools like Pandoc can convert this into clean, semantically pure HTML in an instant — or into a PDF, an ePUB, or virtually any other open format. There’s little need to use XML schemas to capture even complex text, and thus little need for expensive proprietary software to render it readable.

If I were to start a fantasy publishing company this is the approach I would take. Simple modern standards, stitched together with well-supported open source software. No MS Word, no typesetters, no complex proprietary formats or display platforms. Systems for submission and review would look a lot like GitHub. Online publication as HTML, ePUB or PDF would happen at the robotic push of a button. Post-publication review or discussion or correction would come free out of the box. It would all be a lot faster, and a lot cheaper. Maybe I’d start with book publishing, which has less ceremony and more ability to proceed with a coalition of the willing. Increasingly, I wonder why someone doesn’t.