Data Science at the Wellcome Trust: 2020 year in review

Antonio Campello
Wellcome Data
Published in
6 min readDec 8, 2020

A large portion of the data produced in the context of the grant making activities of the Wellcome Trust is unstructured, and consists essentially of text (e.g. grants synopses, academic publications, and policy documents). A big part of the job of the data science team at Wellcome Data Labs is to employ natural language processing (NLP) techniques to make sense of this data.

In this blog post, we will describe different ways in which we applied NLP this year to help grant managers at Wellcome better understand our research portfolio. Here is a non-exhaustive list of applications:

  1. Employed text clustering to help grant managers discover our portfolio in innovative ways
  2. Launched a tool to help grant managers search through the portfolio via disease codes
  3. Used text classification and text clustering to investigate which of our grants produced models/dataset/hidden tech
  4. Used deep learning to parse references in policy documents (and used the above to assess how Wellcome’s funded work was influencing the COVID response)
  5. Used named-entity recognition to search for people in policy documents
  6. Used semantic similarity to link publications that acknowledge the Wellcome Trust but lack a grant number

In most of these projects we worked together with social scientists to do algorithmic reviews and ethical analyses of our work, investigating for fairness and biases.

Neural networks and BERT

In order to deliver those projects in the most effective way, we had to keep track of the latest developments in terms of NLP. In the data science world, the year of 2020 saw the popularisation of powerful techniques for natural language processing, such as BERT-like models, with Google claiming to use BERT for at least 1 in every 10 searches. Those are universal models, powered by deep learning, that achieve state-of-the-art performance in a variety of tasks, including: finding specific entities (names, dates, etc) in a text, categorising texts, and linking different documents. Major open-source libraries (for instance Hugging Face and spaCy) now offer those models off the shelf.

At Wellcome Data Labs, we saw this as an opportunity to systematise and boost our offering in natural language processing, moving more and more towards neural networks, and, when appropriate, BERT. Since we did not want to lose sight of classic baselines, which are still very competitive against the newest models for some problems, we developed easy ways to switch from one another. In doing so, we open-sourced most of our common NLP code (check out the WellcomeML package if you haven’t!).

Below is a summary of some of the NLP projects we developed in 2020.

A map of research fields

Example of cluster map for machine learning fields

Working in collaboration with Wellcome Trust’s Science Division, we have developed a re-usable method to build interactive charts of research fields emerging from the grants we funded solely from texts (such as publications and grants titles/abstracts). This enabled grant managers to interact with their portfolio in innovative ways, generating, for instance, qualitative historical insights of our coverage in terms of research fields we funded.

In order to deal with the scalability requirements for the problem (10s of thousands of grants, and 100s of thousands to millions of publications), we had to replace typical dimensionality reduction with the latest tools, such as tSNE and uMAP. Above is what a zoomed-in snapshot of a map would look like if applied to Machine Learning research publications.

You can read more about this application here.

Tagging biomedical grants

Another way to navigate through research fields is by means of Medical Subject Headings, a bibliometric vocabulary usually employed to index journal articles and books in the life sciences. The vocabulary includes diseases and pathological conditions, which makes MeSH very appealing for interrogating our grant portfolio.

Unfortunately, the volume of historical grant applications (of the order of 100.000s) and the amount of MeSH terms (approximately 27.000) makes it impractical for us to manually tag our grants. Instead, we trained a neural network to do this task for us. The model was trained on a dataset of 8M publications and titles abstracts tagged with MeSH, subsequently transferring it to grants titles and synopses. The model has now been fully deployed, available as an internal tool for grant managers to query the machine-learning-generated tags and provide feedback.

Beyond publications (1): tech

Publications are certainly not the sole research outcomes tracked by Wellcome. In fact, we know that a lot of technology, tools and reusable programming scripts are created as either primary or secondary outputs to grants. However we don’t have a clear way of specifically finding out which grants create this ‘tech’ output. So to identify and estimate the number of grants that produce tech we created a machine learning model. This fits into a larger project in collaboration with the Data for Science and Health priority area and UCL to evaluate the representativeness of datasets used in research we’ve funded.

This model was trained on grant descriptions to predict whether the grant has produced ‘tech’ — i.e. software, models, datasets, in the broad domain of UK health. We tried several modelling approaches, and finally decided to use an ensemble model of 4 of the top performing models. Our model estimates ‘tech’ grants account for 2B of funding.

Beyond publications (2): policy documents

In 2020, Wellcome Data Labs launched Reach, a tool to find and track scientific research in policy documents published by institutions such as the UK Government and the World Health Organisation.

As it turns out, finding academic references in such policy documents is frustratingly hard, due to the lack of a common style and the variety of typographic formats. On the backend of Reach is a machine learning model capable of performing this task. Our model, published as an open source tool in 2020, was largely inspired by the deep reference parser, an academic publication primarily developed to parse reference in the arts and humanities.

The modular approach of our reference parser enabled us to re-purpose it for different applications. For instance, we have used the deep reference parser to help assess how much of Wellcome’s funded work was influencing the COVID response.

More in-depth information about the deep reference parser can be found here.

What is in a name?

Besides linking policy documents to academic references, another possible use case is to search for a particular person (e.g. Chris Whitty or Jeremy Farrar) in a policy document. Although distinctive names produce good results through a keyword search, there are many cases where the search is ambiguous and produces false positives.

Example of true positives and false positives for name search

To overcome this problem, we have trained a neural network-based named-entity recognition model (using spaCy) capable of distinguishing names in policy tests. Interestingly, the off-the-shelf pre-trained models perform poorly on this dataset. In order to develop a strong model, we have manually annotated domain-specific documents. In addition to distinguishing names, we have also explored a pipeline to disambiguate authors with similar name/surname via text vectorisation of the academic papers in the ORCID database.

The authors would like to thank the Wellcome Trust…

Finding people and references inside policy documents leads us naturally to another crucial problem: linking those references to actual grant identifiers. Perhaps surprisingly, this is a long-standing open problem for many funders. Indeed, we conducted an initial analysis on the magnitude of this problem via a large open science platform (EPMC), and concluded that roughly one quarter of the academic publications that acknowledge Wellcome do not provide a grant number. Below is a prototypical funding statement that acknowledges Wellcome but does not provide grant information.

Example of funding statement that mentions Wellcome Trust but does not acknowledge a specific grant number

Since linking all those publications manually is infeasible, we have trained a BERT model to understand semantic similarity between an academic publication and a grant application, capable of providing suggestions for recommended links when they are missing from the database. For this problem, BERT proved to beat all baselines by far, as linking documents is highly dependent on semantic context (more information in a forthcoming blog post).

What’s next

Besides continuous improvement of the aforementioned projects, our plans for 2021 include investigating a more systematic graph-based approach to generating embeddings to our grants, working together with data science teams from sister organisations, and looking at innovative ways in which data science can support the new Wellcome strategy. Stay tuned for more updates!

--

--

Antonio Campello
Wellcome Data

Data science. All things data governance, machine learning and open data.