A Crisis in Discoverability and how we can move towards fixing it

Ashok Giri
PageMajik
Published in
3 min readJul 13, 2018

--

Lacking a single central repository that collects information about scholarly papers from each discipline, it is somewhat hard to estimate the exact number of journals and papers that are published each year. A conservative estimate was generated by Lutz Bornmann and Ruediger Mutz in their 2014 paper Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references, where they track all material — papers, books, datasets, and even websites — cited between 1980 and 2012. From this, they plotted the data and found that the rate of scientific output increases by 8–9% every year, meaning there is a doubling of total output every nine years. (The dip in recent years can plausibly be chalked up to more recent papers simply not having had enough time to be cited)

Growth of the annual number of cited references from 1650 to 2012 (Bornmann and Mutz, 2014)

Admittedly, this is an imperfect measure because it ignores all those sources that were never cited, as well as those simply no longer cited. Still, there is at least a prima facie case that there is a dramatic increase in the amount of research currently created.

And even this might be understating the actual amount of potentially valuable work produced. One academic estimates that every year 10,000 papers gets written within his discipline, which compete for around 2,000 spaces. Those whose papers are rejected don’t just give up, but keep trying to publish in other reputable sources, leading to a backlog which spikes rejection rates to 94%. Since it seems quite plausible that a substantial chunk of those not-published papers might actually be valuable and only missed out because of a lack of space, he advocates for “creating a lot more journal space (maybe 3 times as much as we have now) for the additional papers to be published”.

And this isn’t even taking into consideration the effect of the Open Access movement and the trend of sharing results directly on social media and the web, and how the lack of traditional gatekeepers will almost certainly increase how much content gets produced.

What these discussions mean for publishers is that there is going to be an increasing need for efficiently sifting through large quantities of research output, because if relevant work can be located, then it is immaterial how much more unrelated material is added. In other words, discoverability is going to become an increasingly pressing issue.

I speculate that two kinds of tech changes will be necessary if we are going to deal with this issue. The first is an increasingly fine-grained tagging of content that will permit researchers to conduct incredibly precise searches for the topic they’re interested in. This might mean, for example, that instead of settling for a handful of keywords along with the title and author information, books will have to offer chapter-level tagging to provide more metadata as well as more precise metadata.

But as the metadata requirements get more demanding, it will also become increasingly onerous for the traditional manual generation of relevant metadata. This will call for machine learning approaches to rapidly scan content and generate the relevant kinds of metadata, which can then simply be approved by a human counterpart. This isn’t going to be a simple requirement, because different kinds of data (photos, paragraphs, etc) will have quite different technical approaches, with some involving the clever manipulation of language rules, and others looking to image identification techniques. And different academic fields might require very different metadata, indicating that tech will have to pay close attention to the variety of demands instead of simply producing a generic, high-level solution.

The increase in scholarly output might seem intimidating, but I prefer to look at it more optimistically since it suggests that we have the good fortune to be living in a time where we are producing more knowledge than we know how to handle. With some clever technical fixes, we should be able to harness this increase in productivity across the board, and effortlessly navigate through these changing times.

--

--