McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation

Published in

SyncedReview

4 min readNov 23, 2020

In an EMNLP 2020 Clinical NLP workshop last week, a Montreal-based research team introduced a large medical text dataset designed to boost abbreviation disambiguation in the medical domain.

Nowhere is correct terminology more critical than in medicine and health care, where text mining and natural language processing can build deep learning models for diagnosis prediction and other tasks. Unfortunately, research and clinical applications in this area have suffered from a lack of publicly available pretraining data due to privacy restrictions, and a glut of non-standard abbreviations in the data that is available. Patient-safety organization Institute for Safe Medical Practices earlier this year listed no fewer than 55,000 medical abbreviations which could “fail to communicate with any certainty their intended meaning and present possible dangers to the health of patients.”

The researchers from McGill University, Facebook CIFAR AI Chair and Mila — Quebec Artificial Intelligence Institute introduced the Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) to sort out all those contradictory, ambiguous and potentially dangerous abbreviations.

Created from PubMed abstracts released in the 2019 annual baseline, MeDAL is a large dataset of medical texts curated for medical abbreviation disambiguation tasks that can be used to pretrain natural language understanding models. The dataset comprises 14,393,619 articles and on average three abbreviations per article. The researchers say pretraining on MeDAL leads to improved model performance and convergence speed when fine-tuning on downstream medical tasks.

Unlike existing medical abbreviation disambiguation methods that focus on improving performance on abbreviation disambiguation, the proposed approach uses abbreviation disambiguation as a pretraining task for transfer learning on other clinical tasks. The team built a dataset large enough for effective pretraining as existing medical abbreviation disambiguation datasets are very small compared to those used for general language model pretraining.

The team conducted evaluations on tasks such as mortality prediction and diagnosis prediction using LSTM, LSTM + Self Attention and transformer models. On the mortality prediction task, all three pretrained models performed better than their from-scratch counterparts. On the diagnosis prediction task, Both LSTM and LSTM + self attention’s performance increased by more than 70 percent.

The results suggest that pretraining on the MeDAL dataset can generally improve models’ language understanding capabilities in the medical domain.

The paper MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining is on ACL Anthology. The code is on GitHub. The MeDAL dataset is on Kaggle or Zenodo. The EMNLP 2020 (Empirical Methods in Natural Language Processing) website is here.

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation

Written by Synced