How Much Does Wellcome Fund Technology?

Published in

Wellcome Data

8 min readMay 7, 2021

This piece was co-written by Becky Knowles.

Data and tech are increasingly important to science. At Wellcome, we support data scientists and software engineers who create tools that use data in an innovative way. We put trust into practice by changing how data and software in research are funded, developed and governed.

*Data and tech are increasingly important to science. Photo from* *https://www.pirbright.ac.uk/news/2019/11/pirbright-enters-bioimaging-collaboration-diamond-light-source*

Although we suspected that a significant proportion of Wellcome’s grants have used or created tech in some way, we didn’t have a clear view of Wellcome’s existing tech portfolio. So, we brought teams across Wellcome and University College London to find a reproducible method for identifying and tracking these grants. We combined Wellcome’s data science expertise that is used internally to make better decisions with our externally-focused Data for Science and Health team to identify the scale of Wellcome-funded tech.

But first…

What do you mean by ‘tech’?

By tech, we’re referring to coding scripts, software or tools, particularly those that have been shared with others. This can include open-source software, packages, coding scripts that perform a certain function (such as data cleaning, linkage, or visualisation, and statistical or data science models).

In this piece we will discuss how we built a machine learning model to predict ‘tech grants’ in our portfolio.

How did we do it?

The Dataset

By reading the title and description of grants we managed to tag around 300 grants as having produced some sort of tech and around 300 as not. This process wasn’t easy — and took several people many hours to read through. The dataset of grants and their descriptions is openly available via 360Giving.

Experimentation

We then used the tagged dataset to train a natural language processing model that can take a grant’s description and make a prediction of whether the grant produced tech or not. We trained a variety of models — 12 in total, which included a vectorizer step (turning the words into numbers using methods including TF-IDF, word counts, and BERT and SciBERT pretrained embeddings) and a classifier step (including logistic regression, Naïve Bayes and SVM). The test metrics for these models are given in the table below.

Test metrics for our different models, using a prediction probability threshold of 0.55.

We experimented with using different combinations of these models in an ensemble, as well as different prediction probability thresholds needed for a grant to be classified as a tech grant. The test precision and recall scores of all these experiments, are shown in the figure below.

Precision and recall scores for 12 single models (coloured by their vectorizer type, classifier type not shown), and ensemble models (black) with all possible combinations of the 12 models with a range of different probability thresholds.

Our model

As you can see, in general, using an ensemble of models can dramatically increase test metrics (the top right of the plot being the best area)! However, after all this experimentation with ensembles we found our single BERT vectorization + logistic regression classifier model performed really well, especially when using a prediction probability threshold of 0.55. On our test dataset the model achieved a 0.9 precision and recall.

*Test set results, 0 stands for not tech and 1 for tech.*

Fairness

We performed several group fairness calculations to see if there were any groups for which the model performed better than others. Something that might be important to check is not using the model to reenforce ideas about the “golden triangle” (Oxbridge and London) universities being more worthy of funding than others. Thus, we grouped the grants by the grant holder’s organisation and calculated the test metrics for each group, the results were as follows:

*Test results when grouping the test data by whether the grant holder’s organisation is in the golden triangle universities or not.*

We have to be careful about how we interpret the model results because it is slightly better at picking up tech from the golden triangle (performing about 0.05 higher on both precision and recall — which applied to the whole dataset would account for a difference in about 1000 grants incorrectly labelled). For universities outside the golden triangle the model is more likely to make errors about whether the grant has tech. Therefore, in future work we do using this model, we need to avoid drawing definite conclusions about how much universities from the golden triangle use tech in comparison to other universities.

Evaluation

When research produces technology it may be written about in an academic paper. Furthermore, many researchers have to self-report any outputs from their grants, including technology.

Therefore, using publication data from EPMC which was linked to a Wellcome Trust grant number, we labelled 148 publications as having demonstrated producing technology. We then passed the model over these grant descriptions and found that 58% of them were predicted as tech grants. Similarly with self-reported data we have from ResearchFish we were able to find 70 grants which reported producing technology, for these only 40% were predicted as tech grants. Although these numbers are small, the model didn’t have this extra information when making its predictions, so we are pleased to recall at least some of the tech grants.

What are the Tech Grants?

After training the model we used it to make predictions on all 16,854 grants in the open-source dataset — which covers grants from 2005 to 2019. This gave us 3562 predicted tech grants — 21% of all the grants. This accounted for around £2 billion of funding — 27% of the total funding for these grants.

The largest number of tech grants from any grant type came from the PhD Studentship (Basic) Award, although the tech grants only make up 25% (458 out of 1856) of all grants from this grant type. The Open Access Awards also have a large number of tech grants, and the tech grants make up 99% (424 out of 428) of all grants from this grant type.

*The 20 most common grant types for tech grants (blue) with the number in all grants also given (pink).*

Wellcome has some well-established schemes to support technological resources that benefit the wider scientific community, such as the Biomedical Resource Grants and Technology Development Grants. The proportion of tech grants in these is quite high — 80% (127 out of 159) and 74% (42 out of 57) respectively, but contrary to what we may have expected there are many other grant types producing tech.

Clustering

Using the grants’ description, we can visualise semantic similarities in 2D by using TF-IDF vectorization and the UMAP dimensionality reduction algorithm. Then the DBSCAN clustering algorithm is then applied to this data. This whole process can be quite easily implemented using the TextClustering function from WellcomeML.

The image below shows all the grant data in 2D — each point is a grant. The proportion of tech grants was found for each cluster, and used to colour the points. Hence, we can see that the central grants in yellow/peach are from clusters with a high proportion of tech grants — and grants in dark blue are from clusters with particularly low proportions of tech grants.

*All grants (points) in the dataset plotted in 2D using their semantic similarity. Grants are coloured by the proportion of tech grants in the cluster they are grouped into.*

Clusters with the 5 highest and 5 lowest proportions of tech grants are in the table below. Keywords about the cluster are generated using those with the highest TF-IDF values. The most tech was found in genomics, neuroscience, urban health, and data resources grants. Areas that had least tech were history of medicine, cellular and molecular biology. Although cellular and molecular biology was identified as an area with the least amount of tech, this could reflect the fact that tech is perhaps more ‘hidden’ in these areas — where researchers didn’t mention the tech in their grant descriptions.

Keywords from clusters with the 5 highest and 5 lowest proportions of tech grants.

Conclusion

Different models had quite different predictive power — but with experimentation we managed to find one which gave a 0.9 precision and recall. Using this we predicted that 20% of the grants dataset used were tech grants.

When we applied the model to times where we knew a grant had produced tech based off other sources — publications or self-report, we were only able to recall a small portion of these grants. We know the model performs well when the grant description mentions technology, so in these cases it appears that the grant produced technology but never wrote about it in their grant description. We think this poor performance speaks to an abundance of ‘hidden tech’ in Wellcome’s portfolio: tools and scripts that have been created by researchers during their grant but the grant holders didn’t feel the need to include in the main description.

A large number of tech grants come from PhD Studentship (Basic) Award and Open Access Awards — combined these account for 882 tech grants. The tech grants were less commonly Technology Development Grants or Biomedical Resources Grants (combined these account for 169 tech grants), contrary to what was anticipated. Instead, the tech grants represented an extensive range of award schemes, reflecting the increasing reliance of science upon data and tech across all research areas.

Going forward

Given the abundance of hidden tech across Wellcome’s portfolio, we’d like to support researchers better to allow them to share tech with each other and with Wellcome. As a funding body, we need to:

Demonstrate that tech and code are important outputs of research in their own rights, not only valued by their associations with published papers
Give credit to people who share their tech with us and other researchers (for example, those who follow open science principles)
Make it easy for people to tell us about their tech — some people were only too keen to give us examples (see this Twitter thread), but they may not have a mechanism for doing so.

Thanks

The grants data used for this analysis is openly accessible here. The project’s code can be found on Github.

This project was managed by Becky Knowles. The data science parts of this project were done by Nonie Alexander and Liz Gallagher. Special thanks to Antonio Campello and Nick Sorros for code reviews and their input to WellcomeML which was used in this project. Thanks also to Aoife Spengeman on fairness discussions and help tagging training data.