Extending the Global PID Graph with Non-Persistent IDs

PIDs Are Not Silver Bullets

Luc Boruta
Nov 4 · 3 min read
Simplified representation of URI transmutation in Cobaltmetrics: when presented with an identifier, our API automatically collates citations to all PIDs and URIs known to identify the same resource.*

Nothing lasts forever on the web. Persistent identifiers—a.k.a. PIDs—are designed to slow down link rot for both scientific and non-scientific data. Guiding principles for scientific data management like the FAIR Principles or the Metadata 2020 Principles directly or indirectly advocate the use of PIDs.

We wholeheartedly support that, but persistence is purely a matter of service, and PIDs are not silver bullets. There are billions of research objects that will never be assigned a PID — e.g. works published before the advent of DOIs, and most of the works that fall under the grey literature label — and objects that were assigned PIDs are not necessarily cited with mentions of these PIDs.

PID Fixation

The web is not FAIR, and this is important in the context of scientometrics. Metrics are a sampling game: selection biases are an issue, and imbalanced datasets reinforce discrimination. Moreover, we think end-users cannot be expected to know whether a given identifier is persistent, or whether a given URL is canonical. Citations are also sometimes hidden behind indirection mechanisms like short links and proxy URLs, and different databases will use different identifiers for the same object.

With that in mind, can metrics based on data collection efforts that only or mostly track PIDs ever be inclusive and fair—the regular kind of fair—no matter how extensive their coverage of the scholarly web appears to be?

Extending the PID Graph

With Cobaltmetrics, we consider that tracking PID citations is not enough, and that other identifiers and hyperlinks are also valid citations, including but not limited to web pages—e.g. landing pages on publisher websites—and non-canonical identifiers like short URLs or proxy URLs from services like EZproxy or Sci-Hub. In order to achieve that goal, we index every PID or URI that is mentioned in the sources that we monitor. Then, when our users query our citation index, our URI transmutation API automatically collates citations to all PIDs and URIs known to identify the same resource, whether the cited resource was assigned a PID or not, and whether the citing resource used that PID or a non-persistent ID.

In practical terms, the URI transmutation API combines PID-to-PID mappings, PID-to-URL resolvers, and—my favorites—URL-to-PID unresolvers. The knowledge base that is produced is a very large but simple graph with a single relationship between nodes, namely identifies the same resource as, something similar to yet less strictly defined than owl:sameAs.

In that regard, our knowledge graph is a natural extension of other scholarly graphs like Research Graph or FREYA’s PID Graph. These graphs are extremely important for the future of research on research, but they focus on heavily curated—and thus, from our point of view, idealized—scholarly metadata. Cobaltmetrics adds an interface that reduces the friction between the PID-centric scholarly web and the web at large, that is merely regulated by the HTTP(S) protocol.

Initiatives like FREYA, PIDapalooza, or Research Graph advocate the adoption of PIDs for all scholarly entities and, again, we support that. However, as long as users copy-paste non-persistent IDs from the address bar of their browsers, and until PIDs become the default, our mission is to ensure that entities that were not blessed with PIDs can still be linked with the rest of the scholarly graph. Drop us a line if you want to co-organize the first URLapalooza!

Interested in learning more about Cobaltmetrics? Try it out, check out the public API, join our newsletter, and reach out at contact@thunken.com!

Thunken

Reflections on data, science, and data science.

Luc Boruta

Written by

Chief nerd at Thunken, natural language processor, PhD in computational linguistics. I eat language models for breakfast. 🐐

Thunken

Thunken

Reflections on data, science, and data science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade