Embeddings for Medical Literature

A strong baseline model for semantic search and more

Published in

NeuML

4 min readOct 18, 2023

Semantic search is a new category of search built on recent advances in Natural Language Processing (NLP). Large-scale general language models have rapidly pushed the field forward in ways unimaginable only a few years ago.

Transformers models have enabled more robust text classification, text generation and even extended into other data modalities such as images and audio. Baseline Transformers models can compare text for similarity but it expensive to do at scale.

Sentence Transformers addresses this challenge with a paradigm that fine-tunes a model and pools the outputs into a single fixed dimensional vector. These vectors can then be compared using distance metrics such as the cosine similarity and dot product. There are a number of high performing and generalized embeddings models available on the Hugging Face Hub.

While generalized models do a great job on a wide range of tasks, past work has shown that training domain-specific models can improve overall performance. This article introduces PubMedBERT Embeddings, a fine-tuned Sentence Transformers model trained using medical literature.

Training Data

The first step in building PubMedBERT Embeddings is creating the training dataset. The PubMed baseline dataset is publicly available and has a wide range of medical literature metadata.

paperetl supports processing these raw baseline files and storing the parsed content into a database. The following image illustrates how paperetl works.

After parsing the data with paperetl, the next step is to take the post processed data and build a BM25 index with all titles. Then a random sample of articles are selected. For each article, (title, abstract) and (title, similar title) pairs are created.

Model Training

# Embeddings model
embeddings = models.Transformer(
  "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
)

# Pooling model
pooling = models.Pooling(embeddings.get_word_embedding_dimension())

# Create sentence-transformers model
model = SentenceTransformer(modules=[embeddings, pooling])

# Training dataloader
dataloader = DataLoader(train, shuffle=True, batch_size=24)

# Training loss function
loss = losses.MultipleNegativesRankingLoss(model)

# Training evaluator
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(valid)

# Train model
path = "pubmedbert-base-embeddings"
model.fit(
  train_objectives=[(dataloader, loss)],
  evaluator=evaluator,
  evaluation_steps=500,
  epochs=1
)

Now that we have a training dataset, we can create a model. We use the standard training methods provided by Sentence Transformers. The following shows the parameters used.

A portion of the dataset was held out as validation and test sets. For the validation and test sets, negative entries were created. Random text pairs were matched and given a 0 label.

The fine-tuned model reached a Pearson correlation coefficient score of 96.16 on the test dataset.

Evaluation

Performance of PubMedBERT Embeddings compared to the top base models on the MTEB leaderboard is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.

The following datasets were used to evaluate model performance.

For PubMedQA, the pqa_labeled subset was used, the train split and a pair of (question, long_answer) .

For PubMed Subset, the test split was used and a pair of (title, text)

For PubMed Summary, the pubmed subset was used, the validation split and a pair of (article, abstract) .

Evaluation results are shown below. The Pearson correlation coefficient is used as the evaluation metric.

PubMedBERT Embeddings has the best overall performance on all the datasets tested.

General Text Embeddings (gte-base) is also a strong performer. This highlights the importance of testing models against your own data as gte-base isn’t the leading model on the MTEB leaderboard. Benchmark datasets are only a guide.

S-PubMedBert-MS-MARCO is another Sentence Transformers model but it is fine-tuned with MS-MARCO. The evaluation results are further evidence that fine-tuning with a domain specific dataset often leads to better performance.

Wrapping up

This article introduced PubMedBERT Embeddings, a fine-tuned Sentence Transformers model trained with medical literature. It is a strong baseline model for domain-specific semantic search with medical literature.

The medical space is vast and further fine-tuning on subdomains would lead to even better performance. Reach out to discuss how this can be done for medical and even other domains.