MeSH indexing for biomedical literature

Large scale multi-label classification using ExtremeText

Published in

Synapse Medicine

8 min readJul 30, 2019

In this post, we describe a method used at Synapse Medicine for automatically translating a natural language query in MeSH terms.

by Axelle Rondepierre, Julien Jouganous from Synapse Medicine

What is the purpose of the project?

As part of my master’s in Applied Mathematics and Data Science at INSA Toulouse, I joined Synapse Medicine for a 6-month internship. Synapse Medicine is a French startup whose purpose is to develop an AI platform to organize and make drugs information accessible. It provides physicians with an intelligent virtual assistant able to answer any questions about drugs. It also can analyze a prescription from a picture and give information about contraindication. Synapse Medicine is working on multiple projects to meet the needs of physicians, pharmacists and patients.

One of these projects concerns the PubMed database. PubMed includes several millions of biomedical articles, each indexed by a controlled vocabulary called Medical Subject Headings (MeSH). This vocabulary gives a detailed information of the themes covered in an article and allows for an optimal search in the database. However, this system has drawbacks: 1) each article is manually indexed by human professionals, which makes considerable publication delays, 2) MeSH vocabulary is updated annually but already indexed articles are not reviewed for updates, 3) a truly effective search involves knowledge of MeSH vocabulary.

On that point, we wondered whether it was possible to automatically translate a natural language query in MeSH terms. This would make a new search tool for the platform and an added plus of providing automatic indexing of new articles from PubMed.

What is the MeSH vocabulary?

The National Library of Medicine created the MeSH vocabulary and it has been in use since 1960. The 2019 MeSH thesaurus contains 29,351 terms called descriptors and is updated annually (modification, deletion, term migration). The MeSH terms refer to major topics covered in an article and allow for a more effective search in PubMed.

The MeSH terms are arranged in a tree structure, built from 16 categories.

Within each of these broad categories are further specified tree structures, containing terms that start from general and end up very precise. Each term can be represented by a number that indicates its location in the tree: a letter for the main category followed by a series of numbers. However, a term can still be found in multiple locations in the tree, which means that one term can have several matching tree numbers. To prevent confusion, each term has a unique identifier.

1 MeSH → 1 ID → 1 or more tree number

Dataset and label preprocessing

The NLM provides a database of several million articles as XML files. Information is given for each article, such as title, abstract (if any), publication and review dates, MeSH ID and associated terms. For our needs, we decided to keep the title, abstract, MeSH ID and terms in a file of almost 9 million articles.

Dataset distribution

For almost 9 million articles, 25,940 different terms can be referenced with a very unequal distribution. The imbalance between the MeSH terms is due to:

Under-representation of certain categories:

Distribution of the 30 most common terms for almost 9 million articles

2. Low frequency usage of certain MeSH terms:

MeSH term frequency for 9 million articles

Label preprocessing for the imbalance

1. To address the problem of the under-represented categories, we chose to remove the two most common categories (B01: Eukaryota, M01: Persons) and their children categories.

2. To address the problem of low MeSH terms usage frequency, we chose to move up in the tree to group some terms under the same but more general term. Each MeSH term appearing in less than 100 articles would be replaced with the higher term in the tree. This was repeated until the term appeared in at least 100 articles or the term was the beginning of the branch. As shown in the table below, several terms can be replaced by the same, but more general term.

Example of new term generated after moving up in the tree

After taking these steps, the number of MeSH terms decreased significantly. Moreover, having less precise terms is fine because as mentioned above, they are manually assigned and thus resolved the problem where different people might assign different terms for the same article.

Sample of the dataset after label preprocessing

Multi-label classification with ExtremeText

Our problem now is to assign a list of MeSH terms to a given query. We used a training dataset of 4,508,105 articles and 20,042 different terms.

We first tried to train a logistic regression algorithm using the One-vs-Rest method. This means that a classifier is trained per label, e.g 20,042 classifiers, and it makes the computation time far too long (more than 15 hours for only 300,000 articles). Therefore, we looked into neural networks.

We then tried to train an LSTM network, which is the current gold standard for NLP tasks. Once again, the computation time was far too long (more than 12 hours for the entire training dataset) and it appeared to be a saturation of the network. From the very beginning of the training, the loss function was very low. This may be explained by the sparsity of the output vector (more than 20,000 labels and only 11 on average per article). To have a low error rate, the model only has to predict 0 for all labels. The saturation could also have been due to an excessively small network but with my equipment, I could not train a bigger network in reasonable computation time.

Therefore, we had to find a more suitable solution for our problem: ExtremeText!

What is ExtremeText?

ExtremeText is an extension of FastText, an open-source library developed by Facebook for text classification and word embedding. ExtremeText is adapted for multi-label classification, even with a very large number of labels. It only works on CPUs and has excellent performance in terms of computation time. This is because the training is done simultaneously on as many threads as are available.

ExtremeText architecture

ExtremeText is a neural network with a single hidden layer. Its architecture is made up of two parts: one for the preprocessing and the other one for the classification. It’s a simple linear classifier. The preprocessing part consists of transforming the given document as the input into a vector using an averaged word representation. Then, the representation vector is given as the input of the hidden layer. Finally, the predicted labels are given thanks to a probabilistic labels tree (PLT) as the classification layer, suitable for a large number of labels (a sigmoid can be used instead of the PLT if the number of labels is not too large).

How does ExtremeText work?

First of all, ExtremeText needs a special data format for training and testing. The data have to be in txt files where each line represents an article, in the following format:

__label__<X> __label__<Y> … <Text>

where X and Y are the labels for the article (text), as shown in the figure below.

The library provides two methods to evaluate the model performance: test and predict.

test method: This method requires the input of a txt file (formatted as described above) and the number k of labels to predict. It returns the number of example, the precision and the recall. The precision gives the proportion of labels found relevant: it measures the ability of the system to reject irrelevant labels. In turn, the recall gives the proportion of relevant labels found: it measures the ability of the system to return all the relevant labels.

2. predict method: This method requires an input of a string (or a list of strings), a number of labels to predict k and a threshold. It returns a list of k labels (or a list of results as usually received for a single string) with a greater than the chosen threshold probability and a list of corresponding probabilities.

Results

Once data is in the expected format, we can play with the different parameters (learning rate, epoch, ngrams…). I used the test method to filter different models by varying those parameters. Between the precision and the recall, we are more interested in the precision because we want the predicted labels to be well predicted.

The table below shows the precision and recall for different models and for k=2 and k=5.

Performance comparaison of ExtremeText by varying parameters

The model with preprocessed data (red line) will be the baseline, because the model trained without preprocessing gives worse results and has a longer computation time. The preprocessing consists of stemming words and removing stopwords and punctuation. For the following, we compared models varying one parameter at a time (always using preprocessed data).

For the size of the word vectors (from the default value 100 to 300) and the arity, e.g the maximum number of child nodes (from the default value 2 to 10), even if the results show a slight improvement in precision, we preferred to keep the default values because the time computation increased significantly.
For the n-grams use, we eventually obtained good results with a precision of 0.83 for the last model in the table, using default values for all parameters and 50 epochs.

Let’s look at the precision and recall according to a threshold varying from 0 to 1 for the last model, using 3-grams and trained on 50 epochs (because of the sparse output, the ROC curve is not a good metric).

The figure above shows that a threshold around 0.4 produces good results, with a precision around 0.95 and a recall around 0.43. As explained above, having a good precision is more important for us because we want to be confident in our prediction.

Here’s a prediction obtained for a threshold of 0.4 on an instance for the PubMed test dataset:

title: ‘Pathogenetic mechanisms in hepatic cirrhosis of thalassemia major: light and electron microscopic studies.’

As expected with the chosen threshold, this example above shows that the model does not predict all the expected labels, but the predicted ones where good! The fact that a simple linear classifier works so well for a multi-label classification problem shows we don’t know enough about textual data classification to be able to build effective non linear classifiers!