Toward Privacy-Preserving Altmetrics Exploration with Cobaltmetrics and ORCID
Better Any URI Today Than a FAIR Identifier Tomorrow
You’ve Got Mail
Collecting altmetrics data involves cross-referencing disparate data sources to link, for example, research outputs to the corresponding contributors. Email addresses are used as a de facto standard to uniquely identify persons on the web, especially in sources that do not integrate with identifier systems like ORCID. While email addresses are inexpensive to obtain and maintain, they suffer from significant shortcomings when used as persistent contributor identifiers.
First, email addresses were not designed to be unique or persistent: domain name owners can reassign or retire email addresses, for example after a person leaves an organization. Second, email addresses are not only a mechanism to identify a person, but, more than anything else, a mechanism to directly contact (and potentially spam) that person. Moreover, because altmetrics aggregators strive to collect information about all researchers and all research outputs, mass collection and redistribution of email addresses through automated means can violate individual privacy.
Identifiers in Cobaltmetrics
Cobaltmetrics is powered by a knowledge graph that contains billions of identifiers linked by billions of properties. We combine many different sources to build the graph, generally in the form of linked metadata shared by publishers (e.g. Springer Nature’s Scigraph), trusted repositories (e.g. PubMed Central’s identifier mapping), and identifier registries like ORCID. The graph includes, for example, cliques of article identifiers that identify a given article, or groups of article identifiers linked to the corresponding authors.
There is, in a sense, a lot of redundancy in the resulting graph, as a given document or entity can be accessed by virtually any of the identifiers or URLs that were assigned to it. The rationale is that we want our users to search for the identifiers that they are most comfortable with, and then defer to us for the heavy lifting. For example, we know from Scigraph that doi:10.1038/nature17160 was published in issn:1476-4687, and we know from PubMed Central that doi:10.1038/nature17160 is also referenced as pmcid:4817241, so we can safely deduce that pmcid:4817241 was published in issn:1476-4687. Our users can start from the article’s DOI, its PMCID, its PMID, the URL of the article’s landing page on the publisher’s website, or even a short URL pointing to that landing page. To learn more about the knowledge graph, see our documentation on URI transmutation.
Remixing ORCID’s Public Data
One of our preferred sources for contributor identifiers in Cobaltmetrics is ORCID’s Public Data File. The Public Data File is a periodic snapshot of all public data in the ORCID Registry. Thunken is not yet a member of ORCID, so we access the yearly releases and, at the time of writing, the file released in October 2017 is the latest one.
Currently, we only remix a subset of the Public Data File into Cobaltmetrics. Namely, for each ORCID record, we extract email addresses as well as contributor identifiers from the following registries: Google Scholar, ISNI, ORCID, ResearcherID, and Scopus. Because we use the Public Data File, we know that every email address and identifier was marked as public by the corresponding users in their ORCID records. Additionally, we only add email addresses that were also verified by the users.
The 2017 Public Data File contains 3,979,420 ORCID records. By extracting all aforementioned identifiers, we added a total of 4,725,354 identifiers to the knowledge graph that powers Cobaltmetrics. These identifiers are organized into 3,979,420 strongly connected components (a.k.a. cliques, one per record) of one or more identifiers known to identify the same person. Not all types of identifiers are equally represented in the Public Data File, and we observe the following breakdown by type:
- 3,979,420 ORCID identifiers (84%, exactly one per record, no surprise here)
- 351,798 Scopus author identifiers (7%)
- 351,143 ResearcherIDs (7%)
- 41,976 email addresses (1%, from 8,686 unique domain names)
- 859 ISNIs (<1%)
- 158 Google Scholar author identifiers (<1%)
Regarding the number of identifiers per record, the distribution is very heavily skewed as most ORCID records are not yet linked with the corresponding records in other registries:
- 3,392,796 records (85%) include a single identifier
- 586,015 records (15%) include between 2 and 5 identifiers
- 609 records (<1%) include more than 5 identifiers
Only 1% of the identifiers we collected are email addresses, but it is worth emphasizing that this does not mean that only 42,000 email addresses are present in the Registry. Each ORCID record must include at least one email address, and ORCID users can set the visibility of each of their email addresses to one of three visibility settings: everyone, trusted parties, or only me. Again, because we used the Public Data File, Cobaltmetrics only includes the subset of email addresses that are both verified and visible to everyone.
Toward Privacy-Preserving Altmetrics Exploration
Supporting email addresses in Cobaltmetrics took a lot of consideration and planning to balance obvious privacy concerns and important upsides. We did not want to risk redistributing thousands of email addresses via our APIs, even if these addresses were marked as public in the ORCID Registry.
However, we also wanted our users to be able to query our service using email addresses because they are a de facto standard to identify authors in many documents — including older publications whose metadata will, unfortunately, most likely never be updated with ORCID identifiers — and most importantly because we think email addresses are often easier to remember than numeric identifiers. See the Domain Name System, XKCD #936, and What3words for other examples of words being used as mnemonic devices. We believe that email addresses and other non-FAIR identifiers are acceptable entry points to explore knowledge graphs like the one in Cobaltmetrics, provided that the users are presented with and encouraged to use the corresponding persistent identifiers.
We decided to use a simple strategy to harness both the power of ORCID identifiers and the usability of email addresses: Cobaltmetrics will never expose an email address unless it was explicitly entered into our search bar or sent to our APIs. In other words, for email addresses, we only disclose what the user already knows.
Let’s take an example. My own ORCID record now includes my email address. Once this information propagates to Cobaltmetrics via the next Public Data File, any user will be able to find my ORCID identifier starting from my email address, but they will not be able to find my email address starting from my ORCID identifier, whether they use the web application or the APIs.
We are currently working on contributor-level altmetrics aggregation in Cobaltmetrics. Our goal is to showcase what we know about any contributor from all the sources that we monitor. Of course, we also continue working on adding data from other registries and sources to make altmetrics exploration easier than ever. We keep privacy in mind as we design future features, and we make sure that email addresses can be used as bridges between documents without ever exposing them to third parties.
If you want your data to be remixed into Cobaltmetrics or other applications, make sure to update the public data in your ORCID record! If the community shows interest in this feature, we will gladly consider working on a deeper integration with ORCID’s API to pull fresh data into our knowledge graph as often as possible.
Thanks to Gabriela Mejias, Alice Meadows, and Laurel Haak for their comments on an early draft. For more examples of projects that remix ORCID data, see Free for Everyone, Always: The ORCID Public API and Data File.