Open-source machine learning models for Medical Subject Headings

Published in

Wellcome Data

4 min readJul 11, 2022

Last month we released open-source machine learning models that allow automated tagging of grant applications (as well as academic publications) with over 29,000 medical subjects. The models were co-developed by Wellcome together with natural language experts Mantis. In this post, we will explain the methodology for developing the models, their performance, how they were evaluated, and how data scientists can contribute to improving them.

Why did we build these models and why are they open?

Credit: Viktor Forcags (Unsplash **https://unsplash.com/photos/LNwIJHUtE**)

We built these models to enable Wellcome to better evaluate their scientific funding portfolio and to enable fine-grained search through grants and applications. Medical Subject Headings are a very rich hierarchical vocabulary, with descriptors ranging from general terms such as “Infections” and “Anatomy” to very specific terms such as “Sleep Disorders, Circadian Rhythm”.

We have released our models and code publicly because we believe that it could be of interest to other organisations in the sector that want to automatically classify their portfolio. Tagging applications with such a rich vocabulary manually is not practical, even for small research portfolios, but can greatly improve the way organisations interrogate their data.

What are the models based on?

As with most systems to process natural language nowadays, models are hardly trained from scratch. In our case, we investigated two types of models:

General “cheap” models that perform well for a huge number of labels. A breakthrough in the field happened in 2020, when a research group from Amazon published the first draft for PECOS (Prediction for enormous and correlated output spaces). “Enormous” and “Correlated” output spaces sounded appropriate to our problem. PECOS works by clustering the label space and recursively applying linear models to identify the correct cluster (see picture below). We will refer to the model built based on this framework as XLinear.

Models that performed well for similar benchmarks, such as the Biomedical Language Understanding and Reasoning Benchmark (BLURB). Most recent machine learning frameworks trained for such benchmarks are based on neural networks, and allow for “transfer learning”, i.e., re-purposing for a slightly different task. We have decided to follow on the BertMeSH architecture proposed in a recent academic article and fine-tune PubBertMesh, a large neural language model that performs well on BLURB . We will refer to this model as WellcomeBertMeSH.

The BertMeSH model was trained using the excellent framework developed by Huggingface (yes, that’s the real name of the company!) and released on their hub, allowing for frictionless usage with a couple of lines of code.

Training and evaluation

The models use training data from BioASQ (http://bioasq.org), an annual challenge for biomedical semantic indexing. The training data consist of about 2.5 million academic publications from 2016 to 2019, whereas the evaluation data consists of 220,000 publications from 2020 to 2021.

We have also evaluated the production system (based on XLinear) on data manually annotated by in-house domain experts. Interestingly, the model performs about 5% better on grants. There might be a couple of reasons for that: grant text is often simpler than publications, our grants only cover a small subset of MeSH, and, for our applications, the annotators are more forgiving to small errors.

We felt that the 6% difference in performance did not justify deploying such a large model, therefore the current production model is XLinear.

How to get involved?

Despite many advances in the area, automatically tagging academic text data with huge vocabularies is still an open problem. We believe that this model can be of relevance for searching through biomedical research in general.

Contributions to the hugging-face model are sought, in particular in three fronts:

For the XLinear model, investigate whether using the MeSH inherent hierarchy as “seeds” for the clustering step improves performance.
For the neural model, in principle, the literature [2] tells us that we can achieve about 70% on the benchmark, however we weren’t able to reproduce these claims. Any contributions towards this direction are welcome.
Improve production-level inference time of WellcomeBertMesh so we can replace the linear model.

And if you have any questions about the models, code, or how to get started, don’t hesitate to get in touch directly or to raise a github ticket.

References

[1] — Presentation by Nick Sorros at PyData London 2022 Extreme Multilabel Classification in the NLP Domain
[2] — Bertmesh model — https://pubmed.ncbi.nlm.nih.gov/32976559/
[3] — Blurb benchmark — https://microsoft.github.io/BLURB/
[4] — PECOS model — https://arxiv.org/pdf/2010.05878.pdf