Finding location entities in Wellcome grants

Arne Robben
Wellcome Data
Published in
9 min readJul 18, 2022

This blog post was written jointly by Matt Upson at MantisNLP and Arne Robben at Wellcome.

Image: world map with pins. Credit: Z (https://unsplash.com/photos/TrhLCn1abMU).

The Wellcome Trust (Wellcome) is a charitable foundation with a focus on funding Health Research. Over time, Wellcome has accumulated a database of some 130 thousand grant applications from a wide range of projects, focusing on (amongst others) Infectious Disease, Mental Health and more recently in Climate Change.

Each grant application is represented in a database as an array of information including whether an application was successful, the committed amount, the recipient, the related organisation to the recipient, etc. Whilst the application forms submitted by an applicant contains much useful categorical information — information that allows us to categorise a grant — there are also a number of free text fields that are collected, for instance the application title and project synopsis.

This blog post is about some experiments we conducted to extract useful information from these free-text fields using an approach called Named Entity Recognition (NER). Typically NER can be done in a range of ways: from simply searching verbatim for words of interest (think country names, or other geographical locations) to using sophisticated machine learning models that can learn the entities of interest.

This post follows the project timeline: first a wide exploration of all types of entities we could extract, followed by a more focused exploration of extracting location related entities, with the goal of answering two questions:

  1. Are there location entities in free text fields that aren’t captured elsewhere? I.e. can the location information in our database be enhanced?
  2. How can we interpret the newly found entities?

Casting the net wide

Applying NER to Wellcome grant information had not been tried in the past. Since many types of entities can be extracted, we started off by extracting as many entities as possible in order for Wellcome stakeholders to determine where to focus the project. We used an off-the-shelf deep learning model from the spaCy framework to extract all the traditional named entities that are generally collected such as PERSON, LOCATION, AMOUNT, etc (see the addendum for a complete list).

We also used four specialised models for processing biomedical and scientific text (Sci spaCy). These models extract such specific entities like SIMPLE_CHEMICAL, CELLULAR_COMPONENT and IMMATERIAL_ANATOMICAL_ENTITY. Some of these entities are very specific to the biomedical field that it can be difficult for a lay person to decipher exactly what the entities are. For this we sought help from scientists at Wellcome and have included in the addendum a full list of all these entities and their meanings.

To showcase what was possible, we built a small web-based application to allow stakeholders to see what kind of information they could extract from grants and from other arbitrary text.

Image: screenshot of our Named Entity Recognition tool

As you’ll see in the screenshot, since we used off-the-shelf models for the demo without training them specifically on Wellcome’s grants, they don’t always perform perfectly. Cardiomyopathy is indeed a DISEASE and Africa a LOCATION but Left Ventricular Systolic Dysfunction (LVSD) is not an ORGANISATION! Overall, we can see that the models have done a good job of detecting diseases and locations.

Narrowing down our focus

After some conversations to narrow the scope of the project, we found that Wellcome stakeholders were particularly interested in entities that had to do with location. As you might expect, Wellcome already collects a large amount of location information about its grants, however it gets complicated quite quickly. It’s easy to know where an organisation that is receiving funding is based, but very often that organisation will use the grant to conduct research elsewhere.

As an example, Wellcome has co-funded the World Mosquito Program (https://www.worldmosquitoprogram.org/). This program aims to reduce transmission of mosquito borne diseases such as Dengue, Zika, Yellow Fever and Chikungunya by introducing a Wolbachia spp. bacteria into mosquitoes which prevents the diseases’ spread. The grant is registered to Monash University, based in Australia. Both the organisation name and location are captured directly from the application form in Wellcome’s database.

However, the World Mosquito Program’s trial sites are situated in Brazil, Columbia, Sri Lanka, Fiji and others (reaching over 6M people). These locations are not directly captured in Wellcome’s database but are mentioned in the grant’s summary or synopsis. If we could extract these locations, it would help to answer the question of where the ultimate beneficiaries of the funding are, rather than the people and organisations who receive the funding.

Finding a country relevant to a grant that is not already recorded in Wellcome’s database is thus a three step process:

  • Extracting the locations with NER
  • Resolving the locations to countries
  • Comparing to existing Wellcome records

Extracting Locations

Extracting locations is in many ways the easy part, as it is a very generic task. Many people have tackled this problem before and made their solutions available online for us to use. Helpfully, we also didn’t need help from scientists with expert knowledge to determine what is, and what is not a location.

We experimented with a number of approaches for extracting location entities:

  • Creating a spaCy PhraseMatcher, which uses a simple list of countries that we curated from a list published by the World Bank. Note that this is not as easy as it may sound, as there are many countries that have more than one name by which they are referred: for example Czechia and the Czech Republic.
  • SpaCy’s standard approach to NER, using the en_core_web_md model, which is a medium sized variant of spaCy’s English language model based on an artificial neural network (this is one of the same models we used before we decided to focus on locations).
  • An alternative, experimental approach using spaCy, called the span categorizer. This approach is similar to the standard spaCy approach, but consists of a function which suggests likely entities, and a model which selects the most promising (or none) from the suggestions.

Resolving locations to countries

Once a true location is found, it still needs to be resolved. In our case, we want to resolve a location back to a country. e.g. if the entity “Paris” is found, we probably want to resolve this back to “Paris, France”. One needs to be careful though, as there is also a “Paris” in Denmark, Kiribati, Panama, Texas and Ontario.

There are a couple of strategies to solve this problem, more generally this is known as entity resolution, and it is a common task when dealing with entities that do not have unique names: for example locations and people.

One approach is to train a model to recognise the difference between entities with the same name based on the context where the entity was found. This approach is relatively time consuming however, so in this project we opted to use a third party Geo-coding service OpenCage to handle the resolution problem.

A Geo-coding service simply allows us to send the extracted location entity, and receive back a best guess resolution including metadata about the location. This approach does not take into account the context of the word, so it would not know the difference between “Paris, a town in Kiribati”, and “Paris, a city in France”: both instances of Paris will resolve to the most well known, which is Paris, France.

Comparison with existing records

Finally, once we have resolved locations to countries, they can be compared to the information already available in Wellcome’s database. This is the simplest step because we can simply compare the three digit ISO country codes (e.g. GBR for the United Kingdom) for the countries that we have found in the free text fields, against those already recorded for a given grant.

What we found

We set out at the start of this project with two questions:

  1. Are there location entities in free text fields that aren’t captured elsewhere? I.e. can the location information in our database be enhanced?
  2. How can we interpret the newly found entities?

Are there location entities in free text fields that aren’t captured elsewhere?

To answer this question, we manually annotated 651 grants using a tool called prodigy. This entailed using an NER model to suggest entities, and then manually confirming, correcting, or rejecting them. This created a gold standard ‘evaluation set’ against which we could compare the results of the workflow we built.

We evaluated all three of the NER approaches, but for this proof of concept we stuck with the simplest: the PhraseMatcher. Whilst this approach found fewer location entities than the others (low recall), it was much less likely to create false positives (high precision), which was preferable in this case.

From these 651 grants, the PhraseMatcher identified 1417 locations, of which 1047 corresponded with human annotations, the remainder were false positives, i.e. the model identified a spurious location.

Of these 1047 locations, 891 were resolved correctly to the country by the OpenCage Geo-coding API, and of these 283 were unique occurrences (at the grant level). Once we removed the countries already associated with a grant in the Wellcome database (209), we were left with 74 new countries. As a result, we increased the available location data in the sample by 35%.

This is an impressive result, and demonstrates that this approach has potential to augment the location data that Wellcome collects. Additionally, whilst in this project, we resolved entities to country, there is scope to resolve to finer scale geographies such as region, city, or town.

How can we interpret the newly found entities?

Now we know more location entities can be captured, the question turns to how these newly extracted entities can be interpreted. As mentioned before, location can be difficult to interpret when it comes to grants as Wellcome funding doesn’t always remain with the recipient institutions, it can flow through to clinical trial sites, other research locations, collaborators, etc.

To evaluate the meaning of the newly found entities, we manually annotated them to be in one of three categories.

  • “Not relevant”, which are locations that are correctly found but are actually not relevant to the grant. e.g. a grant states in their synopsis that they are inspired by work in Tanzania and now want to do similar work in Kenya. Both Kenya and Tanzania will be picked up, but only Kenya is relevant. We found 24% of newly found locations to be not relevant.
  • “Related to applicant, organisation or institution related to applicant or collaborator”, which are locations that are traditionally captured well in our systems as they refer to the more administrative locations of the grant. They often don’t reflect where the actual research occurs. We found 16% of newly found locations to fall in this category.
  • “Beneficiary location”, which are the locations that relate to where the research actually occurs, where clinical trials or field tests are set up. These are closest to the location of impact that we really would like to capture. We found that 59% of grants fell in this category.

From the above we concluded that a large proportion of the newly found grants relate to beneficiary location, which is an exciting prospect. However, human moderation of the newly extracted entities is advisable at this stage. Given the importance of this information, this is something that can be considered in the future.

Next steps

To wrap this project up, we set up an automated pipeline that runs NER extraction periodically on all Wellcome grants. This pipeline, which currently runs in the development environment, is the first step towards implementing NER as a routine task on Wellcome’s grants, and making this information available to all users.

In this project we focused on using off the shelf models, and we only annotated a small amount of data sufficient to create an evaluation set to measure performance. In a future iteration, we’d like to train our own models using annotated grants to improve the performance of the machine learning approaches. This will enable the pipeline to identify locations it has not seen before, and obviate the need to manually curate a list of entities we would like to recognise. Finally we would like to explore if we can automatically categorise which of the interpretations is applicable to the newly found locations and how we can involve grant experts in this process.

Both approaches will help us to better answer the question: where is impact occurring for the grants we are funding?

Addendum

Below is a list of all the entities that can be extracted by Spacy NER reproduced from https://spacy.io/models/en.

SciSpacy extracts biomedical and scientific entities. The below descriptions and examples have been sourced by subject experts at Wellcome.

--

--