1 min readApr 26, 2023

Easily Download PubMed Abstracts

An easy method to download all of the PubMed abstracts (plus some other info)

I’m working with these data quite a lot for NLP stuff and found it somewhat difficult to scrape the data directly from the PubMed website. So I created a script to help others avoid this issue. See my github for code.

Summary

Huggingface has this dataset available for easy access: link. The issue is that it’s 360GB+ to download directly with

from datasets import load_dataset
pubmed = load_dataset('pubmed')

So we’ll load this in streaming mode:

from datasets import load_dataset
pubmed = load_dataset('pubmed', streaming=True)

We can then iterate through each entry and save it:

for idx, entry in enumerate(pubmed['train']):
  # print(entry)
  pmid = entry['MedlineCitation']['PMID']
  year = entry['MedlineCitation']['DateCompleted']['Year']
  abstract_text = entry['MedlineCitation']['Article']['Abstract']['AbstractText']
  abstract_title = entry['MedlineCitation']['Article']['ArticleTitle']
  abstract_authors_list = entry['MedlineCitation']['Article']['AuthorList']['Author']['LastName']

Note that I’m only taking the pmid, year, abstract text, abstract title, and authors. There is some more information that you can see by either visiting the Huggingface dataset page or printing one of the entries in the dataset

Summary

Written by Sagi Shaier