Extracting Metadata From Research Articles: Metatagger Versus Grobid

Vu Ha
2 min readFeb 1, 2015

--

Semantic Scholar (codename S2) is a new service being developed at the Allen Institute for Artificial Intelligence for scientific literature search and discovery, focusing on semantics and textual understanding. This search engine allows users to find key survey papers about a topic or to produce a list of important citations or results in a given paper. See this Quora answer for more information about how Semantic Scholar works.

One of the most important components in S2 is the extraction of key metadata such as titles and author names from research papers in PDF format. This is typically done using a machine learning technique called Conditional Random Field. The S2 team recently evaluated two freely available tools:

Our evaluation dataset consists of about fifteen thousand research articles obtained from the Association for Computational Linguists. We focused only on evaluating the accuracy of the extracted titles and author names. We leave to the future the evaluation of the accuracy of other metadata fields such as abstracts, venues, and references.

In our preliminary analysis, Metatagger is currently ahead of Grobid on name accuracy (10 percentage points on exact and 3 percentage points on edit distance) while trailing Grobid on title accuracy by 6pp. We have identified one issue that Metatagger has with extracting titles from certain papers that also contain the titles of the proceedings they belong to. An example is the following:

For a detailed comparison including breaking down to dimensions such as as venue and year of publication, follow this link on Tableau Public (alas Medium currently does not support embedding Tableau visualization).

--

--

Vu Ha

Advising startups @AI2Incubator. Our newsletter: https://ai2incubator.com/insights. Semantic Scholar's first engineer. I like JavaScript/UX too.