Web-Scale Citation Tracking: Can You Read the Signs?

Listen to That Little Voice in Your HEAD Requests

With Cobaltmetrics, Thunken is on a mission to make alternative bibliometrics genuinely alternative. We have discussed at length the lack of diversity in existing altmetrics aggregators. Altmetrics are not alt- enough, and Sugimoto et al. (2017) remark:

One of the critical issues is that these aggregators concentrate on documents that have a unique object identifier, which inevitably neglects certain document types […]. For example, Altmetric.com — arguably the most prominent altmetrics aggregator — focuses its data collection on DOIs, which has led to a de facto reduction of altmetrics studies to journal articles, excluding many types of documents and journals […] as well as most second‐order events, such as the discussion of an article in a blog post or newspaper articles […].

Cobaltmetrics fixes that, and we are just getting started. Earlier this week, we released an update to our URI transmutation API that leverages Signposting, an elegant approach to making the web more friendly to machines.

Better Any URI Today Than a FAIR Identifier Tomorrow

One of the guiding principles behind Cobaltmetrics is that it is not up to bibliometrics aggregators to decide what is citable. Our end-users might eventually apply different weights to different citation patterns when computing metrics, but our role is to observe all patterns on the web. It follows that we cannot merely track documents with persistent identifiers and permalinks. Moreover, even documents that were assigned persistent identifiers are not necessarily cited using these identifiers. The web is not FAIR—and will most likely never be—and that is just fine.

To produce a corpus that is diverse and inclusive, we track all URIs. This is the cornerstone of our approach: every hyperlink, every occurrence of a URI is a citation. Of course, with that approach, many citations in our corpus will never be relevant in a scholarly context, but that is not an issue. Cobaltmetrics is in no way restricted to the scholarly web, and we hope the corpus will be useful to other communities. Most importantly, it is always easier to filter out part of the observations than to estimate statistics for events that were never sampled.

Nothing Is Lost, Nothing Is Created, Everything Is Transformed

There are often many—more or less desirable—URIs that can be used to cite a given document: the landing page on the publisher’s website, the PDF on that same website, the bibliographic records in thematic or institutional repositories, various preprints on personal websites, etc. One of our biggest challenges is to discover URIs that directly or indirectly identify the same resource, so that citation counts and attention scores can be accurately tallied. We want our users to use the identifiers that they are most comfortable with, and then defer to us for the heavy lifting.

In Cobaltmetrics, we refer to that process as URI transmutation. We combine different sources to achieve optimal results, and earlier this week we started relying on some Signposting patterns. Signposting is a set of simple yet powerful ideas to make the web even more friendly to machines, which in turn makes citation tracking even more friendly to gray literature and documents at the frontiers of the scholarly web.

Specifically, our URI transmutation API can now extract information from typed links provided in HTTP link headers. A link header points to a resource that is related to the requested resource, and the type of the link specifies the relation between both resources. The cite-as relation type, for example, provides a canonical—and hopefully persistent—URI that should be used to cite a given document. For example, the HTTP headers for https://www.annalsofgeophysics.eu/index.php/annals/article/view/7507 inform us that the preferred URI for citations is https://doi.org/10.4401/ag-7507. What this means in Cobaltmetrics is that either URI can be used to cite that publication on the web, and we will be able to consolidate the data.

cite-as is not the only relation type that is relevant to perform URI transmutation. We currently use the following types, when available: alternate, bookmark, canonical, cite-as, duplicate, identifier, latest-version, memento, predecessor-version, self, successor-version, working-copy-of. For more information regarding relation types, see the registry.

Reproducibility over Time

Because we aggregate data from many different sources, reproducibility is a challenge. Signposting in particular makes use of HTTP requests. We cannot guarantee that third-party servers will return the same headers over time for a given URI, and there is no reason to assume they will do so over long periods of time anyway. Moreover, we are reluctant to use cache and risk returning stale data. For that reason, Signposting-based transmutation is not enabled by default, and requires the use of the X-Release: unstable header in your requests to the API. See our documentation for more information.

What’s Next?

We are expanding our corpus to monitor even more websites. Please take our one-question survey and let us know which websites you would like to see included as a priority in Cobaltmetrics: archive sites, blogs, corporate sites, government sites, message boards, news websites, anything!

Interested in learning more about Cobaltmetrics? Try it out, check out the public API, join our newsletter, and reach out at contact@thunken.com!