Scraping big data from public research repositories e.g. PubMed, arXiv (2/)

Kaicheng Liang
7 min readMay 4, 2019

--

This is part 2 of the series “A Twitter bot for keeping up with your academic research subfield”. You will need very basic proficiency in Python.

I wanted to try training a machine learning model that would evaluate a research paper’s relevance to biophotonics (my research subfield) simply by looking at the title. This is exactly what I would do when skimming an RSS feed or email alert from a journal — nobody has time to read the abstract, let alone the full text, for over a hundred papers that show up in my Feedly or inbox. (20–5–19 edit: we don’t have time to read the abstracts, but the bot does! I eventually added abstract data, which improved performance.) Before training a model, I needed data. As many examples of research paper titles as possible, and even better if I didn’t have to label them one by one.

Scraping from PubMed

PubMed is an amazing, gigantic source of metadata for mostly biomedical research papers. (Open-access papers are found in the related but not to be confused database PubMed Central.) In addition to a search engine, it also comes with a powerful set of ‘E-utilities’ serving as an API for researchers to freely download large amounts of data. Queries are made by HTTP requests, i.e. a URL that you type into your browser returns your requested data in a clean XML format. This protocol is defined by the Open Archives Initiative Protocol for Metadata Harvesting v.2.0 (OAI-PMH 2.0). (arXiv uses the exact same protocol.)

The PubMed front page, with a powerful search function.

In Python you could use the requests library to make HTTP requests, or you could use the Biopython library, which includes the Entrez module designed for querying databases at the National Center for Biotechnology Information (NCBI), which include PubMed and PubMed Central.

Some of the below was inspired by this excellent blog post by Marco Bonzanini, and his code gist here:

To run this gist, click through to GitHub (give it a star!) and read the comment thread — line 32 needs a fix.

The esearch method returns a list of unique IDs (UIDs) that are then fed back to the efetch method, which then delivers the requested metadata for each of the UIDs. esearch is a powerful tool with several parameters that you can learn more about here. For scraping large numbers of papers, simply set retmax to whatever you need — I’ve tried up to 15000 with no issues. You can also set the parameter retstart to return results starting from some index.

efetch is a little more finicky. The parser Entrez.read sometimes runs into errors when the xml returned by efetch is not in the correct format (you can google ‘pubmed dtd’ for information on this, but I never really needed to figure this out). The parser also seems to dislike overly long xml input, say for thousands of documents. I recommend fetching in small chunks of 50 to 100 UIDs. For example, extending from the gist,

results = search('fever') # or any query you like
id_list = results['IdList'] # list of UIDs
chunk_size = 50 # whatever you like
for chunk_i in range(0, len(id_list), chunk_size):
chunk = id_list[chunk_i:chunk_i + chunk_size]
try:
papers = fetch_details(chunk)
for i, paper in enumerate(papers['PubmedArticle']):
do_something() # perhaps write to a csv file
except: # occasionally a chunk might annoy your parser
pass

Scraping from arXiv

arXiv is another gigantic repository of articles, mostly in physics and computer science, where people post articles before they are peer-reviewed and published in a journal (known as ‘pre-prints’). This is fast becoming the preferred model for disseminating research results in a timely and open-access fashion. There is also a related pre-print server called biorXiv, which caters to the life sciences crowd. arXiv uses the same OAI-PMH protocol, so the process is very similar. There are also Python libraries available for scraping, such as here and here. I didn’t end up using a library for this, because running a HTTP request through Python seemed simple enough and I wanted to give it a try. The API user manual is great.

The Arxiv front page.

One small difference with PubMed is that arXiv search doesn’t seem to have an option of returning UIDs. It simply returns the xml. What’s even better about arXiv is that the xml is in standard Atom 1.0 format, a clean and recognized format that can be interpreted by virtually all xml parsers. I ended up using the same parser that I use for parsing RSS feeds, the excellent feedparser library.

import feedparsern_papers = 30000
chunk_size = 50
category = 'physics.optics'
for chunk_i in range(0, n_papers, chunk_size):
feed = feedparser.parse('http://export.arxiv.org/api/query?search_query=cat:%s&start=%d&max_results=%d' % (category, chunk_i,chunk_size))

for i in range(len(feed.entries)):
entry = feed.entries[i]
title = (entry.title).replace('\n', "") #removes newlines
do_something() # perhaps write to a csv file

If you wandered into this article looking for help with scraping PubMed or arXiv, you now have all the information you need. If you’re following the series on my Twitter bot, read on!

I wanted to scrape article titles that could then be used to train a classifier. The classifer would label 3 classes — 0: non-optics, 1: non-biomedical optics, and 2: biomedical optics. I also wanted to train my model on as much data as possible, likely tens of thousands of titles. Unsurprisingly I was a lot less excited with the prospect of manually labeling tens of thousands of titles.

My strategy was to use PubMed’s Relevance search (the sort='relevance' parameter in esearch) to try and get titles automatically grouped. The 0 class would be just a spread of scientific/engineering articles. The 1 class would be optics articles, hopefully with a physics/engineering slant. And the 2 class would be biomedical optics. In my first attempt, I used the following queries in search:

groups = [['physics', 'biology','engineering'],
['physics optics','engineering optics'],
['biomedical optics', 'biophotonics']]
for group in groups:
label = groups.index(group) # group 0 queries labeled 0
if groups.index(group)==0:
retmax=4000
else:
retmax = 6000
for query in group:
search(query,retmax) #rewrite function to take retmax
...
save_titles_with_label(label)

I saved ~12,000 titles in each class. The eventual trained model turned out OK (~70% accuracy, see next article) and was deployed in the first version of my bot. However, a good reminder here is that you should always examine your data carefully before any model training. When I finally thought to look at the data I had scraped, I was annoyed to find that my class 1 data contained many biophotonics articles. This really should have been no surprise because the database is PubMed, where articles are largely funded by the National Institutes of Health and have a substantial biomedical slant.

These noisy labels also meant that my accuracy estimates would be off. Titles such as “High-efficiency femtosecond ablation of silicon with GHz repetition rate laser source”, nothing to do with biophotonics, were classified to be 69% class 2, and “Quantifying cellular forces and biomechanical properties by correlative micropillar traction force and Brillouin microscopy”, pretty much quintessential biophotonics, was 22% class 2 — pretty dramatic failures. I briefly flirted with the idea of ‘quickly’ going through my class 1 data and fixing the labels — I managed to check 200 titles in 15 minutes. At that rate (~5 seconds per title) it would take me 15 hours. No thanks.

Then I realized that arXiv’s physics.optics category would be the perfect class 1 (non-biomedical optics) data. In recent years people have started to post biophotonics articles there too, but these are a tiny fraction of the papers. So for class 1 I pulled about 20,000 articles from physics.optics. For my class 2 data I used the PubMed queries ‘biomedical optics’, ‘biophotonics’, ‘biology optics’ and ‘biomedical microscopy’. The last query is a little vague, since many microscopy techniques are non-optical, but I wanted to add a stronger microscopy flavor (for the large Twitter audience in that niche) and more diversity from the ‘biophotonics’ results, which were dominated by papers in optical coherence tomography and photoacoustic tomography. I also did not account for repeated titles in these collated searches, mainly because I was already short on examples in class 2 and did not want to pare down too much.

The data I eventually used for the model.

For each paper I found, I wrote a line of data to a csv file. Occasionally a few titles, particularly from arXiv, have unicode characters in their titles, which cause problems when writing to csv and/or interpretation by humans/ML. There is a nice library unidecode which you can install with pip install unidecode and converts unicode characters to a readable format.

import csv
from unidecode import unidecode
...
with open('data.csv', mode='w') as data_file:
file_writer = csv.writer(data_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for paper in papers:
file_writer.writerow([unidecode(title), str(class_label)])

Code for this section is in scrape_from_pubmed_arxiv.py, soon to be posted on Github.

--

--