Every year, researchers publish an average* of over 7000 academic papers that acknowledge the Wellcome Trust. Our guidelines state that a Wellcome-funded researcher should explicitly mention a grant number in all research outputs. However, in reality, at least one quarter of the publications that acknowledge Wellcome are not linked to a grant number. This means that when reviewing our funding portfolio’s academic outputs, our analysts and managers have the caveat that one quarter of the publications are unaccounted for. Below is a typical acknowledgement statement that doesn’t mention a grant number:
To solve this problem, the data science team at Wellcome got together with the Strategy & Data Insight team to hunt for missing links. We developed a machine-learning model that scans a publication with missing links and predicts the most likely grant that produced it. Now that we are in the last steps of the project, we can say that we anticipate to recover, with high confidence, at least 60% of the missing links. This means we now know how our funding contributed to 17,000 previously unclaimed publications!
The general idea for Wellcome Link is simple: given a publication that acknowledges Wellcome but does not mention a grant number, what grant in our grants database is most likely to have produced it? For example, looking at when the paper was published, we might search for grants awarded in ‘compatible’ dates, whether any authors of the publication are also grant holders, or look for grants on similar topics. This is indeed what analysts had to do in the past due to the lack of grant links, with some additional clues that might come from grant reports. Linking manually is not a good use of analyst time, and is hard to apply consistently. This is where Machine Learning can help.
We ran two workshops with colleagues from various departments at Wellcome, trying to recreate the process of linking grants and publications. For every publication, we asked two reviewers to try and guess what grant produced it. This generated a golden set of correct pairs, and helped us assess how long it takes for the task. Despite the subjectivity of the topic, we saw that reviewers agreed on the correct link for up to 80% publications.
Grant-publication similarity model
Our idea was then to use the machinery of text similarity to come up with a machine learning model that can automatically check how similar a grant is to a publication. Here is a schematic representation of our first idea:
This model doesn’t quite work off-the-shelf! Some considerations:
- In practice, we don’t compute similarities between all pairs of grants and publications. We pre-filter grants who have matching authors and exclude those whose date is out of range, so we can narrow down the ‘feasible’ set. This step is a natural way of filtering the search, which, in addition, greatly reduces the computational cost (a naive approach for computing similarities would have order O(G x P) where G is the number of grants and P the number of publications)
- Instead of declaring all pairs above a threshold to be linked, we first sort the recommendations of a given publication, and pick the top ones above the threshold. This gives an ‘information-retrieval’ flavour to the problem
- Since the correct rank of related grants is the most important outcome (i.e. we want the most similar grants to be sorted first), we chose to optimise the area under the curve, a typical ranking metric.
There are many ways to transform texts to compute similarity. In the scoping phase of the project, we decided we would experiment with three of them: a simple baseline (TF-IDF); a transformer-based deep learning method, BERT; and a fine-tuned version of Bert trained on scientific corpus, SciBERT. For BERT and SciBERT, we have used the frozen-weights of the pretrained models.
Our hypothesis was that SciBERT would outperform the TF-IDF baseline, since it has been shown to beat the state of the art in many scientific literature natural language tasks. Below are the result for one of the initial experiments:
Matching@N means that the correct grant was among the first N guesses. None of the models performed particularly well and, surprisingly, SciBERT was not capable of beating the TF-IDF baseline. More importantly, the area under the curve was very poor for the three models and setting a universal threshold was impractical, in part due to the curse of dimensionality.
In the previous model there was one piece of information being massively underused. We were merely computing a vector based on the grant/publication texts and calculating a similarity metric. The ground-truth was only being used to evaluate the model!
As I mentioned in the beginning of the text, one quarter of the publications miss a link. This means that the remaining 75% of publications can be used as ground-truth data to train a supervised model. This model would have to predict whether a pair publication-grant is “semantically similar” or not. The most similar task in the natural language processing literature for this is called “Semantic Textual Similarity”. We thus built a semantic similarity model, using BERT and SciBERT as layers. The code for this model is available in our WellcomeML library (see the docs for a quick example).
Here is a scheme of the supervised model:
- The ground-truth data only consists of ‘positive’ pairs of linked publications/grants. In order to train a semantic similarity model, we need to sample ‘negative’ examples from the universe of unlinked pairs. This has to be done very carefully, otherwise the problem becomes too easy (e.g. if the negative samples are generated from completely different subjects). There are a couple of possibilities on how to generate reasonable negative samples, including using TF-IDF. In our case, we picked grants-pubs pairs that were definitely not linked, but from the same authors. This way, we made sure our model was learning an actually “hard” task.
- Besides the text, the model is also fed the dates of the publication and the grant, which, intuitively, should boost its performance. For example, we know for a fact that the majority of publications occur during the last year of a grant. Probably due to our pre-filters (we only feed the model pairs which have a compatible date), this feature only had a marginal improvement to the area under the curve. However enabling the ability to add metadata to training the transformers model is beneficial in the long run, and for other possible applications, so we decided to keep it.
This time, the fine-tuned SciBERT model outperformed all three unsupervised models, with the first suggestion being correct up to 60% of the time, and the first three up to 74% (there was a plateau after 5 suggestions). More importantly, the model can return meaningful confidence scores that actually reflect the probability that a pair is related. Its area under the curve was above 92%, meaning we have freedom to trade sensitivity for specificity as required.
Final considerations and caveats
During the development of this project, we assessed algorithmic fairness (including the excellent Deon checklist), and ran workshops with main stakeholders. A couple of important issues have been raised and addressed, such as fairness across career stages and groups of grants missing from the analysis. In particular, there are two groups that we miss. Firstly, we can only link publications whose authors are directly involved, and certain funding activities might be harder to link (e.g. a equipment & resource grantholder may not be acknowledged by research using that resource, even if it does acknowledge the resource itself). Secondly, the algorithm only covers publications that acknowledge the Wellcome Trust (but not a grant number). It does not account for unlinked publications that don’t even acknowledge Wellcome, which we don’t have an estimate for.
Since the main objective for this project was to generate trustworthy publication-grant linkage, we decided to set a very high threshold, and maximise for precision. For deployment, a pipeline that automatically updates a “staging” database with predicted links can be triggered either manually or scheduled via Argo (in the future, we are looking at incorporating an a human-in-the-loop process where reviewers can automatically flag links). The information of this database is then consolidated by the data engineering team with other sources, and exposed internally, alongside the confidence scores. Analysts or any other member of the internal Wellcome community can then make use of the predictions, helping enhance our understanding of academic outputs generated from grants.
The code for the semantic similarity model mentioned in this blog post is available on WellcomeML. Read the docs for an example usage