Using deep learning to find references in policy documents

Published in

Wellcome Data

5 min readJun 25, 2020

Neural networks’ namesake. Neurons derived from neural stem cells. Credit: Yirui Sun. Attribution 4.0 International (CC BY 4.0)

This blog was written by Matt Upson from Mu Analytics whilst contracting with Wellcome Data Labs.

One of the key concerns for any organisation that does or funds research is: where does that research go, who does it reach, and what impact does it have. At the Wellcome Data Labs, we’re working on an open source tool called Reach which helps to answer this question by looking for references to academic publications published in global policy documents, for example those produced by policy organizations such as the World Health Organisation, Médecins Sans Frontières, and the UK government.

A key functionality of Reach is the ability to recognise academic references within these ‘grey literature’ policy documents which are not captured by the traditional academic publishing metrics.

Finding references is disappointingly hard

The main source of difficulty is the great variety of typographic formats in which policy documents are published, whilst almost universally these documents are published in portable document format (pdf) which is difficult to read automatically (there are some great examples of this from fillingdb). In addition, there is a huge variety in the possible ways to present bibliographic information; the Citation Style Language project records over 2000 possible reference formats, presenting a challenge to software automatically trying to extract references. Moreover some policy documents do not adhere to any known citation style at all; sometimes all you get is a web link.

All these factors come together to make the task of reference extraction (finding a reference within a text) and reference parsing (finding the individual components like author, title, etc.) quite tricky. In the current implementation of Reach, we use a combination of deterministic and probabilistic methods to deal with this problem. The process looks like this:

A pdf is read by Reach, and a simple heuristic determines which pages contain reference sections
The reference sections are passed to a splitter model which uses machine learning to determine where a reference starts and ends, returning a list of all the found references
Each found reference is processed by another machine learning algorithm to classify whether a token (word) is an author, title, year, volume, etc.
The title is then fuzzy matched against a database of known publications to return a unique document ID (for example a PMCID or DOI)

Deep Learning

We wanted to see if we could improve on our relatively naive machine learning models (used in steps 2 and 3) with a deep learning approach that would perform better and be simpler to evaluate and maintain. Deep Learning is a subset of machine learning that uses artificial neural networks (ANNs) to learn progressively simpler representations of a problem, allowing a computer to build complex concepts out of simpler ones.

In their 2018 paper, Rodrigues et al. used a deep learning model called a Bi-Directional Long-Term Short-Term (BiLSTM) Recurrent Neural Network (RNN) to solve a similar problem to Reach but in the arts and humanities. This type of model excels at sequence data like text documents. We planned to replicate their model but specifically for references to medical references in policy documents.

No bricks without clay

With this new approach, we needed data from policy documents that had been labelled at the token (word) level. This means that every single token in the document needs to be manually annotated, for example the reference:

Dambisya YM, Kadama P, Matinhure S, Malema N, Dulo C (2013). Literature review on codes of practice on recruitment of health professionals in global health diplomacy…

becomes:

(Dambisya, author) (YM, author) (,, author) (Kadama, author) (P, author) (,, author) (Matinhure, author) (S, author) (,, author) (Malema, author) (N, author) (,, author) (Dulo, author) (C, author) ((, o) (2013, year) (), o) (., o) (Literature, title) (review, title) (on, title) (codes, title) (of, title) (practice, title) (on, title) (recruitment, title) (of, title) (health, title) (professionals, title) (in, title) (global, title) (health, title) (diplomacy, title)

This annotation process is manual and laborious. We used a tool called prodigy and developed a number of our own tools to make it significantly faster, but even so the manual annotation task took a few weeks.

Open source for the win

Rodrigues et al. (2018) published their model openly, providing us with a great headstart to building and training our own model. Nonetheless we needed to make quite a few changes to prepare the model for use in a production setting — details of which, and results, can be found here. After a few attempts we were able to achieve good success on the individual reference splitting and parsing tasks using a combination of Rodrigues’ and our own data. Like Rodrigues et al. we also combined our splitting and parsing models together into a single model which reduces the resource requirements for training and using the model in Reach. This multi-task model also achieved good results.

These models are quite time consuming to create, and take between 11 and 16 hours to run on a high performance machine in the cloud with a cost of between about $10 and $14 each time we train it.

We have published these models openly on GitHub. If you have some basic knowledge of python you can get started very quickly with just two lines of code.

Next steps

It has taken a number of months to go from the idea of using deep learning to a workable implementation. Along the way we’ve learnt some valuable lessons, not least having to rethink the backend architecture of Reach in order to accommodate the more computationally intensive deep learning models, and developing a way for data scientists to get access to high power machines for training deep learning models.

The obvious next step is to annotate more data and retrain the model as we expand Reach to take in more sources of policy documents, but for now the results are quite encouraging.

References

Rodrigues Alves, D., Colavizza, G., & Kaplan, F. (2018). Deep Reference Mining From Scholarly Literature in the Arts and Humanities. Frontiers in Research Metrics and Analytics, 3(July), 1–13. https://doi.org/10.3389/frma.2018.00021