Handing Science Over to the Machines

And other consequences of rewarding quantity, not quality

Mark Humphries
The Spike

--

Credit: Pixabay/holdentrils

Name me the exceptional things you do once every five days. Not the basics of your ablutions, your plumbing, and stuffing your cake-hole with, um, cake. Not the daily cycle of home-work-home-TV. But the extra-ordinary things, the things that take that little extra effort. Watching a film at the cinema; playing sport; entertaining friends? Bathing the elephant; crocheting a teddy bear; continuing construction of your Lego cerebellum granule cell layer?

How about publishing a scientific paper? A singular work of deep scholarship, analysis, or experiment, publishing a scientific paper is an exceptional thing by anyone’s standards. A recent report in Nature has revealed a select group of scientists who published one paper every five days for at least a year. One paper every five days.

That means each of these scientists was listed among the authors on at least 72 papers in a single calendar year. And these were not the thousand-plus lists of authors on papers from particle physics, not those gigantic teams of atom-smashers at CERN. These were papers about research into the effects of genes within large populations, or tracking human health over our lifespan, or neurology. Research that some would consider everyday, run-of-the-mill science.

There are many possible reactions to this hyper-productive cadre. Some would feel the sting of inadequacy, contemplating how they sweated and bled on the torturous path to build one paper, and wondering how they took such a long and winding road when there exist scientists who can apparently produce effortlessly. Others would interpret the same contemplation of their own experience as awe, placing such evidently extraordinary scientific talents on a pedestal.

The sensible reaction is of the absurd. At that prodigious rate, the scientists who authored more than 72 papers a year are unlikely even to have read all of them. Let alone contributed directly to the work inside, writing a few words or providing some key idea. Let alone doing some meaningful science in their construction.

At this point, you may be permitted to think “what the fork!?” The mere existence of papers whose authors haven’t even read them is the big red flag that the exponential increase in scientific output is showing no signs of slowing down. Rough estimates claim somewhere between one to two million scientific papers are published every year. That’s 5470 every single day.

Which begs the question: who are these papers for? Who are those one paper every 5 days for, if even their own authors have not read them?

Possibly, no one.

There is a nihilistic version of this answer, which runs as follows. Papers exist as the productive unit of science. Measuring science by its papers reduces it to a time-and-motion study of counting the number of papers and proclaiming brilliance, to the weigh-in of a fishing competition where the heaviest stack of papers is proclaimed the winner. Potted biographies of famous scientists routinely say “Professor X has authored over 300 scientific articles”. And you wonder: really? (And also: how did he fit all that in with Magneto wiping his hard drive every five minutes?). The counting is what counts. Not the content, not the ideas, just the mere existence.

The less nihilistic version of “no one” is that papers need not have an audience. Indeed, reading many of them, it becomes plain that the authors forget someone else is supposed to read these things, as they dive into their personal patois, piling noun upon noun upon noun until your brain is begging for an adverb, an adjective, hell even a pronoun would do.

Rather, papers need not have an audience right now, but one day. For there are two types of scientific paper. There are those that give a shove to knowledge: they prove something right (we should all go in this direction); they prove something wrong (stop going in this direction); or they tell of a new method everyone should know about and use immediately. The kind of papers that make the mainstream news. (More accurately, papers dressed up to look like these kinds of shoves to knowledge make the news). These are the papers meant to be read right now. But they are few in number compared to the total outpouring of papers.

Most papers are archival knowledge. Scientists find something out, make a record of it, and publish it as a paper. Then future scientists can build on it, use it, when there is a reason to. When it becomes apparent, for example, that knowing how a tiny piece of bacterial DNA fends off a virus is the basis for a technique that edits genes in literally anything’s DNA (CRISPR there).

The archival knowledge idea might be less nihilistic, but it just pushes the problem into the future. Yes, we’d like to know what we already know; we’d like to not repeat the same ideas and experiments. But there is too much archival knowledge to imbibe. It’s like asking for a shot of single-malt whiskey, and getting a firehose of Jack Daniels in your face.

Take dopamine. I like dopamine, and I’d like to know more about it. PubMed tells me that last year alone there were 4586 papers published on dopamine (in 2017). If I narrow that down to specifically dopamine and reward? 495 papers last year. And apparently five papers have so far been published on this in the future — 2019 (ah, academic publishing there, still so wedded to pre-digital tradition that papers which already exist have to be assigned a monthly “issue” of the journal in which they will formally appear. Despite that journal not having published on dead-tree pulp for a decade. Or ever.)

I don’t know about you, but the prospect of reading 495 papers on dopamine and reward does not fill my heart with joy. Mainly a dull ache. And explicit tests of dopamine’s role in reward are just a tiny part of that story. Other things we’d like to know include how the neuron itself behaves; what its inputs are; where its outputs go; its responses to things that aren’t reward; its role in Parkinson’s disease; etcetera, etcetera. The archival knowledge about dopamine is growing faster than anyone could possibly comprehend.

Also I’d like to read something that isn’t about dopamine. About how reward is encoded in prefrontal and orbitofrontal cortex, perhaps. About habit learning. About reinforcement learning. About cross-validated principal components analysis, because I’m just that cool. About Poppy, her hamster Wilberforce, and their starry-eyed misadventures in DoofusVille avoiding the clutches of Reginald the Malingerer and his hoard of ruffians.

Neuroscience has no privileged position in The Flood of papers. In his biography of the insanely prolific mathematician Paul Erdos, Paul Hoffman calculated that close to a quarter of a million mathematical theorems were being published every year. Scientists have nothing on the productivity of mathematicians; but we’re drowning all the same.

Contemplating The Flood leads us to some interesting conclusions.

No one person can link together the existing web of literature about a single topic. And that web may never stop growing. It’s at the point where a single author of 72 papers a year cannot even link together their own papers.

Which means there is no single person who will know “the answer” to a given complex problem. So “the answer” can come in one of two guises.

The first is a computational model of the problem. We see this epitomised in climate research. Huge computer models of the Earth’s climate synthesise an extraordinary amount of data, and bring together a vast array of individual bits of knowledge — about cycles of glacial melt and run-off, of reflections and trapping of sunlight, of feedback between carbon dioxide, temperature, and foliage, to name but a few. The model itself becomes the culmination of a vast research enterprise.

Running those models gives us answers to complex problems: they might answer the questions of how temperature depends on carbon dioxide, of where those temperature changes will hit hardest and first, and of what changes to the Earth’s environment will stop those temperature changes from careening out of control. And building these models tell us what we don’t know, of where we need to fill in the gaps to make the models better, leaner, smarter.

The second guise is an AI. Groan. But the idea that an AI will someday “know” the answer should come as no surprise. After all, we already use machine-learning extensively to make sense of data-sets that are too big for one person to comb through. And what bigger data-set is there than the collective scientific knowledge of humanity? (Answer: Lego’s database of all possible permutations of small plastic bricks that make money).

It’s already happening. Do you not use Google or PubMed to search the literature? People are already using machine-learning to do systematic literature searches (Iris.ai, Semantic Scholar), to find links between research findings, and show them in a comprehensible form. These are fancy classifiers, learning to group together published work and data-sets by key words. Developments in language processing are beginning to let the machines link findings together to suggest hypotheses, like linking gene expression changes to mental disorders. The next step after that will be to have the machines write the literature reviews, synthesising existing knowledge into a form we mere mortals can understand, and pointing out to us what we don’t know.

Building a machine to tell us what we don’t know is exactly what Jessica and Bradley Voytek did. Scraping 3.5 million abstracts from PubMed, and linking them by key-words for brain regions, disorders, and cognitive functions, they built a model of neuroscientific knowledge. This model naturally has a hierarchy: “cortex”, “thalamus”, and “striatum” are all children of “brain”, for example. Which opened up a simple but effective hypothesis generator: find two concepts that share a parent, but have not been linked together in the existing literature. That pair of concepts are then candidates for linking together. (One can imagine generalising this even to a flat web of links, by seeking a pair of concepts that are each strongly associated with a third concept, but not (yet) strongly associated with each other.) Here is a dumb machine that already gives answers no human could possibly find on their own.

And as The Flood grows then, even in narrow disciplines, the machines will be the only ones who know all the links, the only ones that can put together the big picture. So even if we never develop a true AI that can by itself infer new hypotheses and create new ideas, even if that sci-fi scientist is off the table, we will still become dependent on dumb AI for the “answers”.

If you don’t like these far-flung futures, the paper-every-5-days people still teach us a couple of valuable lessons for the here and now. For one, complaining about your work not being cited is churlish. Literatures are vast, people miss stuff, give them a break. Don’t complain, collaborate: point others to work they may have missed, and explain why it is relevant, and how it might inform their future work.

Ultimately they teach us that much of what we publish is worthless. I think we can safely say that any paper whose own authors haven’t read it is unlikely to contain ground-breaking information. The lesson to us all is that we should learn to ask: who am I publishing this paper for?

Want more? Follow us at The Spike

Twitter: @markdhumphries

--

--

Mark Humphries
The Spike

Theorist & neuroscientist. Writing at the intersection of neurons, data science, and AI. Author of “The Spike: An Epic Journey Through the Brain in 2.1 Seconds”