More about open citations — Citation Gecko, Citation extraction from PDF & LOC-DB

Baader-Meinhof phenomenon describes a cognitive bias where learning a new word tends to make one notice the word being used more.

It might be the same with new ideas and concepts. Since I started to learn about open citations, I start seeing them everwhere.

For example I recently learnt about Citation Gecko — a novel literature mapping tool that allowed you to map out your research using “seed papers”.

The key here is that Citation Gecko was able to leverage citation links between seed papers and other papers to help highlight possible papers of interest.

For instance it could reveal papers that were cited frequently by your seed papers (co-citations) or conversely papers that cited your seed papers a lot (bibliometric coupling) among other tricks.

What is Citation Gecko

But where is the data from?

An earlier tool I was familar with called whocites did something similar but scraped the data from Google Scholar, which was very slow and led to a lot of captchas.

Citation Gecko is a lot faster and uses open citations from OpenCitations, Crossref and Microsoft Academic.

In particular, I suspect a ton of citations come from Microsoft Academic but in my testing it was pretty easy to hit the API rate limits for that.

Project LOC-DB — Can libraries capture , process and release linked open citations?

I was recently chatting with @Zuphilip a librarian in Mannheim, Germany and he mentioned about the LOC-DB project.

You can read the paper here but my understanding is it is a very interesting project by a few German Libraries to scan print books and process electronic journals in areas such as social sciences , extract the citations in those items and process them into “standardized, structured format that is fit for sharing and reuse in Linked Open Data contexts”.

In a long run, if the process is efficient enough and sufficient libraries to do this, one could imagine a crowd sourced web of science like citation index might emerge.

But obviously this isn’t a easy undertaking and requires embedding the process into the library workflow.

LOC-DC front end

First, a lot of work was done on improving on the state of art for automatic reference extraction of ingested full text and their solution if I understand it correctly, is to train a “layout driven” model to learn which sections of the page are likely to be references e.g. which lines are belong to the same reference (“reference segmentation”) before using OCR on them and passing the string to Parcit for metadata extraction.

This improves on the state of art which is to use Parcit directly to detect references in text.

The whole process is only semi-automated though, because you will still need an editorial system for librarians to confirm and match the extracted citation strings to items from internal and external databases (e.g. Crossref, e OpenCitations
SPARQL endpoint, Google Scholar etc) as well as occasionally correct metadata.

Editorial system for librarians to confirm citations and metadata with external and internal databases

Again, I highly recommend you read the paper as it goes into more detail on the nuts and bolts and with a final section focusing on how long the process takes and whether it is realistic and worthwhile to do this.

The upshot is at the current rate, Mannheim University library would need
between 6 and 12 people to process all literature of social sciences bought
in 2011 by Mannheim University Library.

Reference extraction for fun & profit — or Crowd-source citations?

I have recently being playing with Scholarcy a Chrome extension that extracts and provides summaries of papers from PDF. One of the more nifty features is that it can also extract references from the PDF (in a similar manner to Project LOC-DB) and extract everything into a RIS or Bibtext file.

Export extracted references from PDF in RIS or Bibtex in Scholarcy

I can see such a feature being useful in it’s own right for researchers who want to do systematic reviews by quickly importing a list of references from seed papers into reference managers.

Citations imported from Scholarcy into Mendeley isn’t always perfect, improve it by doing Crossref lookups of doi

The extraction from Scholarcy isn’t always 100% correct, so like the Project LOC-DB, you may need to verify the data. In the case of Mendeley, if a doi exists you can do a crossref lookup.

I wonder if a reference manager with the functionality of ingesting pdfs to parse citations (e.g. Zotero in plugin) might be useful to double as a way to crowd source open citations which feeds back to a open citations system like Opencitations.

A wild idea would be a system with the features of citation gecko, except instead of the references and papers just coming from Crossref etc, you could also add on references from this pdf extraction system.

For instance you might have a seed paper not known by the usual sources, no problem, you upload the extracted references in pdf, then associate the references in them with crossref etc.. and you are in business (at least in terms of seeing what it cites on your citation map)