Searching for People (in text)

Published in

Wellcome Data

6 min readAug 18, 2020

Part of Data Labs’ newly released product Reach is the ability to search for research being mentioned by global health organisations — you can read more about Reach here. For example, you might want to see all the policy documents which have something to do with malaria. You also might want to use the tool to search for the times a particular person, e.g Chris Whitty, has been mentioned in policy documents.

St Bernard dogs effectively search for people in the Alps, we want to effectively search for people in policy text. Head of a St. Bernard dog. Reproduction of an etching by F. Lüdecke.. Credit: Wellcome Collection. Public Domain Mark

We are now working towards optimising searching for people. In a simple approach to searching text we can just search for someone’s name and see if there is an exact match for it somewhere. If we do this for ‘Whitty’ we currently find 176 mentions, most of which are to do with a person called Whitty. However, there are cases where the name you are searching for is a bit more ambiguous — if we search for the name ‘Wood’ we find 4136 mentions, but a large number of these are false positives (i.e. not discussing a person, but rather places like ‘Wood Green’ and ‘Old Wood Farm). To solve this problem we use named entity recognition (NER) and entity linking.

Finding people using named entity recognition

Luckily for us we can use a pre-trained named entity recognition (NER) model from spaCy to find people entities within the text of policy documents. SpaCy’s model uses convolutional neural networks trained on a large bank of English language text including telephone conversations, news articles and web-blogs. Some things that this model recognised as people’s names were ‘Prison Finder’, ‘Nine Elms’, ‘Bandstand Marathon’, and ‘Foodborne Diseases’. The reason this doesn’t perform too well is because the text the model was trained on wasn’t from the same domain as what we are applying the model to — i.e. policy document text. But all is not lost, we can tag some people’s names in policy text and give the model this data to retrain on.

Creating training data — tagging names

We used a tool called Prodigy to tag all the names of people within chunks of samples of policy text. So far we have tagged 1,033 people in 316 chunks of policy text. This process involves being presented with a few sentences of text and having to highlight all the people you think there are in it, an example of this in action is shown below.

A screenshot of using Prodigy to tag people entities — Tagging people in policy text using Prodigy

Named entity recognition results

SpaCy’s pretrained en_core_web_sm model achieves a performance metric (we use the F1 Score, where a score of 1 would be perfect predictions, you can read more about this metric here) of 0.58 for person entities on our tagged data, and retraining it with this extra data we tagged gives us a performance metric of 0.83 for person entities. These scores can be broken down by where in the document the person entity was found:

How well the model performs when the name is in different parts of a policy document. We see it does particularly well for names in a references section, but less well for names in citations within the text.

Two examples of finding people entities tagged in policy documents using our NER model are:

Here you can see an example of a person being correctly tagged as a person (‘Manisha Shridhar’), but an address in New Delhi (‘Mahatma Gandhi Marg)’ being incorrectly tagged as a person (‘Marg’ being the Hindi and Punjabi word for ‘road’).

Here the citation to a reference by ‘Maitland’ is correctly identified as a person, but ‘Kenya’ has been misidentified as a person.

Disambiguating people’s names

The next (and slightly more ambitious) thing we can do is to link the people found with who they are, this is called entity linking/disambiguation. For example, searching for ‘Farrar’ gives us sentences which are to do with a few different people called ‘Farrar’ such as:

… Trends Microbiol. 2015;23(7):429–36. doi:10.1016/j.tim.2015.02.006.69. Farrar D, Duley L, Medley N, Lawlor DA. Different strategies for diagnosing gestational diabetes to improve maternal and infant health. Cochrane Database Syst Rev. 2015;(1):CD007122. Review. 70. Fioretti BT, Re…

and

… School of Nursing, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil Dr Jeremy Farrar, Director, Oxford University Clinical Research Unit in Viet Nam, The Hospital for Tropical Diseases, Ho Chi Minh City, Viet Nam Professor Maria Guzman, Head, Virology Department…

So how do we know that the first of these examples is about the Bradford-based maternal health researcher Diane Farrar, and the second is about the director of the Wellcome Trust, ex-director of the Oxford University Clinical Research Unit in Vietnam and infectious disease researcher, Jeremy Farrar?

To disambiguate them we need a ‘knowledge base’ — a collection of text about each person. For this we use the information publicly available on ORCID’s website (an organisation dedicated to providing unique identifiers for researchers). The way we’ve approached this so far is to download all the ORCID text available for certain names — in this example ‘Farrar’. Then we convert these texts to numbers using TF-IDF (term frequency — inverse document frequency) in order to be able to compare them quantitatively.

Linking people’s names to their ORCID

TF-IDF vectorisation gives weights for each of the words in the ORCID texts based on how frequently they turn up and how unique they are to the document, e.g. the word ‘the’ comes up a lot in Jeremy Farrar’s ORCID profile, but it also comes up a lot in every other ORCID profile, so this word is given a low weighting. The highest weighted words in Jeremy Farrar’s ORCID TF-IDF vector are ‘dengue’, ‘Vietnam’, ‘Vietnamese’, ‘influenza’, ‘meningitis’; and those in Diane Farrar’s ORCID are ‘diabetes’, ‘pregnancy’, ‘gestational’, ‘women’ and ‘Bradford’.

We compute the vectorisation for the sentence containing the person we want to disambiguate, and then find the closest TF-IDF vector to this from all the ORCID profiles using the cosine distance between the two vectors. If the cosine distance to the closest TF-IDF vector is less than a threshold then we will classify this person entity as not having an associated ORCID.

Entity linking results

As a proof of concept we looked at how well we could disambiguate people with the name ‘Farrar’. Of the 168 mentions of a ‘Farrar’ in the policy text we found 81 were about Jeremy Farrar), 9 were about Diane Farrar, 1 was about John Farrar and 77 were about people who don’t have an ORCID. Our model achieved a performance metric of 0.62.

Next Steps

More training data — we could always do with more training data in order to make these models better. In particular it’d be good to have more training data from people found in parts of policy documents outside of references, since these currently perform less well.

We also want to test how well the entity linking does on people other than those with the name ‘Farrar’, so need to expand this feature.

We need other sources for the knowledge base. Since not everyone has an ORCID we could also try to get text from other sources, one option is Wikipedia.

We are still learning about how to improve this feature. If you have any thoughts or you are interested in testing out this feature, then please get in touch.