Detecting Novel Research Abstracts With Natural Language Processing

Published in

MBF-data-science

6 min readDec 28, 2018

Scientometrics, evaluating scientists and scientific groups by their publications, is how I got into data science. Many of my projects have used the Stirling Diversity Index (SDI) as a proxy for interdisciplinary, which is accepted best practice in the field. The SDI measures how a publication, or corpus of publications, citations are distributed across a set of categories. It’s similar to the Gini coefficient times a distance measure, so that a paper that combines art history and mechanical engineering would have a higher SDI than a paper that combines biochemistry and molecular biology.

Most applications of the SDI take advantage of the longitudinal stability of Web of Science journal classifications, where every journal is assigned to one to three of 224 categories by the Web of Science team, as well as cosine similarity-based distances prepared by Rafols, Porter, and Leydesdorff. These are widely used measures, but I have a few concerns:

Web of Science categories were designed for librarians, not researchers. They appear valid, but they’re still proprietary, and being used in a way they are not intended to be used.
There are many reasons to make a citation, from connecting to prior work, staking out a theoretical position, refuting an opposing view, or because Reviewer #2 tells you to. The SDI conflates all these reasons for citation. At best, we can say that the authors cited a specific work because they thought it was important to.
The Stirling Diversity Index has not, as far as I can tell, been validated against expert judgement of the interdisciplinarity of articles. This would be a great research project for the future, and could be easily parceled out to people who are already reading journal articles regularly.

In short, while the Stirling Diversity Index is the accepted standard, it’s not absolutely bulletproof. Along with citations, a second data source for scientometric analysis are article abstracts. Abstracts are the researcher’s own description of their work. They are more detailed and closer to the ground truth of what the paper actually is about compared to citations.

I acquired over 70,000 article abstracts and citation records from the past decade published in the top 40 journals by journal impact factor, stored them in a NoSQL database, and analyzed them using Latent Dirichlet Allocation (LDA) with Gensim. The statistical assumption behind LDA is that each document is drawn from from some probability distribution across topics, and that each topic is itself a probability distribution across words.

Of course, this being Python, libraries do all the heavy lifting. My topics had pretty good face validity, they all seemed to be scientific, and different forms of science, but face validity is not good enough.

Taking the journal that an article was published in as the target, and the LDA topic distribution as features, I ran my dataset through a random forest classifier and was able to associate articles with the right journal with 96% accuracy. Even better, examining the confusion matrix showed that the most misclassified journals were Cell, Nature, and Science, major interdisciplinary publications which have no single topic. My model is not only accurate, it makes the same mistakes that humans would!

Then came the frustrating part. I compared the topic distributions to the SDI in various ways, and got no result. No single topic, or use of multiple topics in a journal article, correlated with the Stirling Diversity Index in any comprehensible way.

I generally like my correlations to be something other than zero.

The single most important skill for data scientists is being able to think rigorously about what you’re really measuring.

Mathematically, a Dirichlet distribution is a higher-dimensional generalization of the beta distribution, and each word is stemmed/lemmatized to a numeric identifier token. Behind the scenes, ‘fertilization’, ‘fertilized’, and ‘fertile’ might all be encoded as ‘fertil*’, which is Token 438652. And 438652 has some probability of being in a given abstract equal to the sum of its appearance in the topics that compose theabstract. What the LDA model does is translates a sample of natural text into a probability distribution in a tokenized vector space of scientific language. And probability distributions can be compared using Jensen-Shannon Divergence.

The math is one of those ugly sum-of-logs things, but I like the Python code to calculate it. If P and Q are vector probability distributions, the Jensen-Shannon Divergence is the mean of the entropy between P and the mean of P & Q, and the entropy between Q and the mean of P & Q.

import scipy.stats.entropydef JSD(P, Q):
    _P = P / norm(P, ord=1)
    _Q = Q / norm(Q, ord=1)
    _M = 0.5 * (_P + _Q)    return 0.5 * (entropy(_P, _M) + entropy(_Q, _M))

In plain language, what we have is a number between 0 and 1 that describes how much an abstract differs from random jargon in its field, where higher numbers indicate a greater difference.

This metric describes how novel an article abstract is, relative to the field as a whole.

The distribution of novelty scores is a beta distribution, which may be a result of their generation from dirichlets. I’m not sure. Statistical proofs are rather involved.

We can however compare novelty to citation velocity (number of cites / years since publication, a common measure of research hotness), and that gives us an interesting result. Citation velocity is log-normal, which is an expected result since citations demonstrate preferential attachment patterns.

Plotting novelty against log (citation velocity) shows a slight but clear negative trend. And this makes sense!

First, science is slightly conservative. Most publications are the result of what Thomas Kuhn, in The Structure of Scientific Revolutions, deemed “normal science”, incremental problem solving in established patterns. Successful science, defined by what is published in the top journals and cited by other scientists, is likely to use similar terminology to what has come before. Wholly new jargon are like seeds cast on hard ground.

Second, novel science that is also successful science, like CRISPR gene editing, changes terminology in the field. Since the LDA model is trained on the past ten years of science as seen in November 2018, major results appear as ‘less novel’ because the model doesn’t understand the causal relationship of how scientific jargon changes. A new discovery that had very limited take-up would be seen as more novel in the model. Correcting this artifact would require training iterative models for each year, which is a reasonable next step in this project.

And third, a fraction of highly cited works are major review papers, efforts by leaders in a field to sum up what is known. For obvious reasons, review papers do not invent new terminology.

I’ve created some interactive visualizations using Tableau, where you can browse articles by novelty score and citations velocity, and compare how different journals break down by topic. The code for this project is also up on Github.

Data science has useful applications in scientometrics. That said, the Stirling Diversity Index is more parsimonious than any LDA model, and I haven’t yet fully delineated where the novelty score might best be used.

Lego Grad Student doesn’t know which journal would be the right fit for his paper.

That said, I think the underlying technology could be useful in the preliminary steps of grant review, or helping decide which of the thousands of journals out there actually publishes stuff like your most recent project. And we’ve shown with data science that science should be novel, but not too novel.

Detecting Novel Research Abstracts With Natural Language Processing

The single most important skill for data scientists is being able to think rigorously about what you’re really measuring.

This metric describes how novel an article abstract is, relative to the field as a whole.

Written by Michael Burnam-Fink