Enabling editors through machine learning

Examining the data science behind Meta Bibliometric Intelligence

Meta
9 min readDec 9, 2016

Executive Summary

Every year, millions of manuscripts are submitted to tens of thousands of journals worldwide. On the front lines of this surge in global research output, editors are under constant pressure to make quick and critical decisions about the manuscripts they are tasked to review.

Some manuscripts get rejected immediately — usually if they are not aligned with the journal or publisher’s core focus. For the rest, an often lengthy process begins in which the manuscripts undergo multiple rounds of reviews and corrections[1]. Months into the process, many still get rejected, forcing the cycle to start again at a different publishing venue.

Outside of scholarly publishing, new developments in machine learning and artificial intelligence are changing the world in which we live. Siri and Google Assistant have transformed our daily interactions with our personal electronic devices. Deep Blue, Watson, and AlphaGo have effectively demonstrated the ability for computers to make smart decisions in
a gamified environment[2]. And intelligent recommendation engines are serving us our favourite books, movies, and songs, before we even knew they existed[3].

Unfortunately, many of these advancements have not been applied to scientific publishing. As a result, editors continue to invest time and energy into work that does not require their seasoned judgement, stealing away their valuable time from their core responsibilities and their own research activities.

This article examines how Meta Bibliometric Intelligence provides quantitative tools that complement the qualitative expertise that editors bring to their tasks. By alleviating bottlenecks, Bibliometric Intelligence allows editors to once again focus on the critical work that only they can do.

Answering three core questions with Bibliometric Intelligence

In order to help streamline the publishing process, Bibliometric Intelligence helps editors quickly answer three core questions for each manuscript they receive:

1) Is my journal an appropriate venue for this manuscript?

In a detailed report generated automatically by the system (see Appendix A), Meta provides a summary of the scientific concepts discussed in the paper, as well as a journal matching score. In the case of a publisher with multiple journals, Meta can further expand on this score to rank the publisher’s journals in order of relevance.

2) What is the potential impact of this paper?

The most common metric for measuring a manuscript’s importance, validity, and impact is its citation count. However there are limitations to this metric[4]. In recent years, alternative metrics have surfaced, including Relative Citation Ratio[5] and Eigenfactor®[6] (which is currently used at Meta). Meta’s system, described below, takes a holistic look at the manuscript’s text along with its associated metadata to estimate the future impact of the manuscript, three years post-publication.

3) To whom should I send this paper to review?

The generated report can include a list of suggested reviewers for the editor to contact. Based on the data currently available, Meta can only suggest reviewers who have published papers relevant to the subject matter and who do not seem to be associated with the authors of the manuscript. This is an iterative process — as Meta works with its partners, it will continue to optimize this process.

By providing answers to these questions, Meta Bibliometric Intelligence can empower editors to quickly make informed decisions. However, it is important to note that the system does not evaluate the quality of the science or the conclusions made within the manuscripts. Without human editors and the peer review process, the debacle of the Sokal scandal will be relived again and again[7].

Predicting manuscript impact

Meta predicts a manuscript’s three- year impact based on a combination of features derived from the manuscript, as well as metadata extracted from Meta’s scientific knowledge graph. Meta extracts 201 metadata-based features, including information about the authors’ past papers and their impact, citations, and institutions, as well as deeper concepts like diagnostic procedures, medical devices, and regulatory activities. These features are concatenated with 200 text-based features representing the topical distribution of the manuscript.

While journal placement can strongly impact the number of citations a manuscript will accrue, the manuscript is evaluated solely on its own merits. Therefore, no journal information (such as impact or publisher) is included in the features during testing or evaluation.

The complete set of features is fed into a deep neural network which jointly predicts paper Eigenfactor, citation count, and whether it would be a top paper, as measured three years post-publication.

Figure 1: Bibliometric Intelligence model overview.

Training the system

Training and validation of the model were carried out against papers published in 2011, and the results presented here are for papers published in 2012. This was done to avoid overtraining against trending topics and buzzwords.

Historical snapshots of the knowledge graph were taken for the following dates — June 1, 2011, September 1, 2011, and June 1, 2012.

Figure 2: Meta’s Eigenfactor and citation prediction data snapshot.

This provided a picture of what the landscape of science looked like at the time of publication for all papers in the training set. These snapshots were scrubbed of any information that would not have been available during those dates — including papers, citations, authors, and any derived associated metrics.

For each snapshot, a three-year post publication citation graph was captured for the following dates — June 1, 2014, September 1, 2014, and June 1, 2015 respectively.

These snapshots contain over 150,000 published papers that were used for training and validating Meta’s impact prediction system.

Figure 3: Eigenfactor prediction accuracy. The strength of the shaded blue area indicates the confidence interval for the prediction. The 80%, 90%, and 95% intervals are marked with dashed black lines, indicating that 90% of the predictions are within 1 of the true Eigenfactor (EF). Papers with EF > 5 are in the top 1% of all publications.

Demonstrating viability

In order to compare the results, a baseline was created by using the journal impact factor. In general, researchers submit manuscripts to the journal with the highest impact factor they think will accept their papers, while editors accept the papers they believe will have the highest impact. The publishing journal can therefore be viewed as the compromise between these two opposing forces. In other words, it is the agreement between editors and authors, and represents the best efforts of humans at sorting articles by impact.

The baseline used was the median citation count and Eigenfactor for the journal where each paper was published. In testing, it was determined that journal median citation count (or median Article Level Eigenfactor) serves as a better predictor than both mean counts and journal impact factor. It is worth repeating that the Bibliometric Intelligence pipeline does not predict results using any journal information.

All training and model selections were carried out using a five-fold cross validation on the 2011 data. The results of the final selected model on the 2012 data are presented below:

Figure 4: Performance summary for impact prediction model. The baseline is median citation count and Eigenfactor for the journal where each paper was published.

In one analysis, Meta identified 572 papers that were predicted to have the highest impact. This subset was revealed to generate an average of 54 citations over a three-year period, compared to the overall average of 7 citations across the entire test set. Of the 572 papers, 185 (32%) were indeed in the top 1% of papers based on citation count, and 367 (64%) were in the top 5% of all publications. Comparatively, of the 778 papers from the dataset that were published in the top six biomedical journals — Science, Nature, Cell, PNAS, NEJM, and Lancet — 119 (15%) were among the top 1% of papers and 280 (36%) were among the top 5% of papers, based on their citation count. The results of this large-scale trial demonstrate that Meta is able to perform 2.7x better than the best baseline estimator at predicting article-level impact for new manuscripts prior to publication. Additionally, it performed 2x better than the baseline at identifying “superstar articles” — those that represent the top 1% of high impact papers, prior to publication.

Journal matching and journal cascading

Another piece of useful feedback for the benefit of both publishers and authors is a compatibility score between the manuscript and the journal to which it was submitted. Additionally, publishers with many journals within their portfolio can benefit from receiving a list of alternative sister journals, ranked in order of compatibility with the projected article-level impact and topical fingerprint of the manuscript. Both of these tasks can be solved in a similar manner.

Figure 5: Using a classification model trained on over 15,000 positive and negative paper-journal matches, Meta ranks the best journal matches for a given manuscript.

The problem was framed as a binary classification for a given paper-journal pair. To generate training data, 500 papers were extracted and matched to the journals in which they were published. As discussed previously, the journal in which the paper was published is the one on which the authors, editor, and reviewers could agree. Negative examples were generated by selecting 10 random journals, as well as 25 journals known to be a close match using Meta’s journal-to-journal recommendations. Very similar journals were accounted for by adjusting the weights of the negative examples.

Different classifiers were evaluated for performance, including random forests and neural networks. Eventually, gradient boosted trees proved to be both highly accurate at identifying the journal in which the paper
was published, and at generating a good list of cascading journals, as verified by human curators. The model achieves area under the ROC curve of 0.984 and F1 score of 0.92.

Reviewer suggestions

For editors, finding ideal reviewers for a paper can be a time consuming challenge. One common approach is to ask the authors for suggestions, however, there’s often little that editors can do to verify the legitimacy of the suggestions, which leaves this practice open to fraudulent activity[8].

A good reviewer will be a subject matter expert and provide meaningful and detailed feedback in a timely manner. Even so, identifying good reviewers is a highly subjective process. While it is difficult to predict the quality of the reviews that a researcher would provide, Meta can help identify potential candidates that are experts in their fields. Stringent filters are applied
to ensure that only active lead researchers who have not co-authored with any of the manuscript’s authors are considered.

Meta developed a heuristic algorithm that takes into consideration many different aspects of candidate reviewers. These include signals such as past papers and their topical similarity to the manuscript adjusted for recency, the impact in the relevant field, and various signals about the candidate’s publication history[9]. The final lists of recommended reviewers are validated by human curators, who continuously work to optimize processes. The information is available in the final report as a recommendation to the editor.

Figure 6: To recommend reviewers for peer review, Meta considers the topical similarity of past papers, a reviewer’s impact on the relevant field, and publication history.

Summary

Global scientific output doubles every nine years[10]. As those on the front lines of this exponential surge, editors are under increasing pressure to manage and triage the growing volume of submissions that flow among thousands of journals, through cycles of submission and rejection, on an uncertain path to publication. Meta Bibliometric Intelligence provides an intelligent, scalable tool to help them meet the demands of this evolving manuscript/journal ecosystem. Simply put, it saves editors time so that they can focus on what they alone can do.

For editors who wish to test Meta Bibliometric Intelligence for themselves, contact solutions@meta.com, or visit http://meta.com/publishing.

About the Authors

Liu Yang is a data scientist at Meta. She received her PhD in Molecular Biology and Genetics from Cornell University and her Masters in Computer Science from the University of Toronto.

Shankar Vembu is a senior data scientist at Meta. He received his PhD in Computer Science from the University of Bonn in Germany with a focus on Machine Learning.

Amr Adawi is a data engineer at Meta with a BSc from the University of Toronto. Amr’s focus is in building distributed AI platforms that revolve around discovering data patterns in large unstructured data sets.

Ofer Shai is the Chief Science Officer at Meta. He holds a PhD in Computer Engineering from the University of Toronto with a focus on Machine Learning and Computational Biology and has extensive industry experience in genomics, information retrieval, recommendation systems, and analytics.

Appendix A: Sample Bibliometric Intelligence Report

--

--

Meta

Meta is helping build a future where people have more ways to play and connect in the metaverse. Welcome to the next chapter of social connection.