Understanding the implications of Open Citations — how far along are we?

Published in

Academic librarians and open access

14 min readApr 30, 2018

This post has been revised substantially 26th May 2018 with valued advice from David Shotton, Director, OpenCitations, though responsibility for any misstatements that may still exist are solely mine. I would like to thank him for his patience in clarifying misconceptions I had, and I apologise for getting the name of his organization , OpenCitations mixed up and misrepresenting some of the important work the organization is doing.

The academic discovery space seems to be buzzing again. This space has become relatively stable after the introduction and maturity of Web Scale Discovery between 2009–2013, but things seem to be hotting up once again.

With the recent interest in integrating discovery of open access, as well as linked data (with a dash of machine learning and text mining) we have the beginnings of an interesting situation. A third development which was harder to forsee is the rise in the open citation movement which I will focus on in this post.

How did this movement begin? How does the size of the open citations compared to gold standards like Scopus and Web of Science? Who are the players that use it (e.g. Digital Science Dimensions) and how might it develop in the future?

Introduction

In recent years, we have seen the launch of many innovative discovery search engines, such as Yewno, Semantic Scholar, Meta, Open Knowledge Maps.

But in terms of citation indexes, we see the offical launch of 2 new comprehensive citation indexes — Microsoft’s Academic and Digital Science’s Dimensions taking their places along side the big 3 namely Web of Science, Scopus and Google Scholar.

Microsoft’s Academic was in beta for 2 years before finally launching officially late 2017. I’ve covered the preview version in a blog post in May 2017 last year. A interesting thing to note is that unlike it’s more famous rival Google Scholar, Microsoft data’s is somewhat open and is available via the academic knowledge API.

But let’s focus on the latest one to join the fray Digital Science’s Dimensions.

Digital Science’s Dimensions

I’ll do a full review in a later post but the bit I want to focus on is that Dimensions uses open citations and other metadata from Crossref (which the Initiative for Open Citations (I4OC) has lobbyed publishers to make open) as well as from other sources like ORCID, oadoi and Grid.

Thanks to @i4oc_org for making things like @DSDimensions possible! Really exciting to be involved with such a fantastic initiative #highered
— Dimensions (@DSDimensions) January 19, 2018

Open citations? What manner of beast is that?

If you haven’t been keeping track of this development, this post is for you.

Open citations

While many are aware of the push for Open Access, I suspect fewer are aware of the push for open citations. This is a call put out by The Initiative for Open Citations (I4OC), to make citations open.

But let’s take a step back.

For a long time, the only way to get citation data was via paid citation indexes — either via Clarivate’s Web of Science or Elsevier’s Scopus.

But this changed fairly recently. Firstly, PMC allows one to extract open citations (with some work) but that obviously covers only the life sciences. But what about the other subjects? Where does one get a more complete set of open citations?

Open citations in Crossref

You might not know this but lot of citations to journal articles and book chapters are actually open and available in Crossref.

When publishers submit their article metadata (e.g. title, author, journal) to Crossref for DOI registration many of them (around 1/3 of publishers including most of the big ones) also choose to submit references of articles in their journals.

Why would publishers do that? This is because doing so will give them access to CrossRef’s cited-by service to publishers which has been in operation since 2007.

So what does it do? It is actually a service by Crossref. to help publishers check what items are citing their articles.

Publishers who are allowed to use the cited-by service can use an API to retrieve information for displaying cites on papers in their website.

They can see not just the total number of cites from other items in Crossref but also the actual references. Do note that by default, they can only see cites to their items and not cites to other publishers.

This cited-by service is offered free to Crossref publishers by Crossref but there is one obligation.

To use this service, besides depositing the usual article metadata (title, author etc) into Crossref, they will also need to deposit the references of the articles.

https://www.youtube.com/watch?v=31u6Iz_ENC8

Do note that the publisher choosing to deposit references into Crossref isn’t sufficient to make it open to everyone, it merely gives the publisher access to the cited-by service and not everyone else.

While anyone can access the counts via the usual Crossref API, the citations themselves needs to be explictly made open by the publisher depositing the references.

The reference distribution policy by Crossref dated Jan 2018 allows publishers to set their reference to one of the following levels.

Open — anyone can access citations via standard Crossref API
Limited — only accessible via new paid Crossref Metadata API plus
Closed — Not usable by anyone. Used only in cited by service — i.e only publisher of item that was cited will see the citation.

To see which publishers have submitted references and the distribution status refer to the following list maintained by Crossref.

It is these sub-set of references deposited by publishers that are made open that makes up the bulk of open citations used by Dimensions and other consumers of open citations.

Impact of I4OC on open citations

But where does the Initiative for Open Citations (I4OC) come in?

It is one of the major goals of I4OC (of which OpenCitations (OC) which we will mention later is a founding member) to encourage more publishers to make the references they deposit with Crossref open.

I4OC has achieved great success in encouraging publishers to make the references they submit into crossref open. As of Jan 2018, publishers have made “more than 50% out of 38 million articles with references deposited with Crossref.”

When they first started it was 1%.

The list of major publishers who have deposited references and made their citations open are amazing. Most of the big publishers such as Springer-Nature, Taylor and Francis, Wiley and Sage are already doing this. See list of publishers here.

How significant is this achievement relatively speaking?

First, notice that the 50% open citations figure above refers to 50% of “articles with references that are deposited in Crossref” and this excludes articles that do not have references deposited.

How do things look like after we take that into account.

Data visualized (not completely to scale) from https://opencitations.wordpress.com/2017/11/24/milestone-for-i4oc-open-references-at-crossref-exceed-50/

The latest analysis I could find is an analysis of Sept 2017 Crossref data in “Milestone for I4OC — open references at Crossref exceed 50%” and it states that 51.7% of journal articles in Crossref lack references.

Of the articles that have references (100–51.7=48.3% of total), 50.7% are open, which implies (48.3%*50.7%) 24.5% of journal articles deposited into Crossref have references that are open.

For non-journal items (mostly book chapters) only 20.4% are deposited with references.

How does the citations in Crossref (both open and non-open) compare with Scopus and Web of Science?

While the above analysis is interesting, the traditional gold standard for citation indexes is Web of Science and Scopus. How does references deposited into crossref compare?

The CWTS analysis — “Crossref as a new source of citation data: A comparison with Web of Science and Scopus” gives a very detailed look by comparing against publications in Web of Science and Scopus from the period 2012–2016.

The upshot is around 39.7% references in Web of Science match a open reference, and this figure is 34.8% for Scopus.

If all references in CrossRef were included (both closed and open) , this would rise to 77.1% and 69.1% respectively. This isn’t too bad, particularly since the authors note that due to matching difficulties for doi, these figures are a lower bound on the actual figure.

Improving coverage of open citations — 2 ways

There are two ways to improve coverage of open citations. Firstly, get publishers who already deposit references to Crossref but keep their references closed (see list here) to make them open. Secondly , we need to get publishers who are not depositing references at all to do so.

“Elsevier references dominate those that are not open at Crossref”

The first way making citations already in Crossref open seems to be low hanging fruit and this is what I4OC focuses on. After all, all you need to do is to email Crossref support and give permission.

So who are the major hold-outs? There are a few but the major culprit here appears to be Elsevier.

In a post entitled “Elsevier references dominate those that are not open at Crossref”, the authors find that of the 470 million references in journal articles deposited in Crossref that are not made open a impressive 65.1% of them are from Elsevier articles!

But of course as you can see from the above CWTS report even if 100% of citations deposited in crossref were made open, there would only get us to 69.1% coverage of Scopus.

The second way to improve coverage of open citations is to focus on articles deposited in Crossref that do not have references. Analysis in the already mentioned CWTS articles has an intriguing finding.

The top two publishers with missing references were in fact publishers who officially support the I4OC call to make citations open! Springer Nature for example has 10 million references missing (mostly from book chapters).

Notice something even more interesting. There’s a very significant omission in the list of top 15 publishers — Elsevier!

This implies that Elsevier has relatively few missing citations (needed to match Scopus) not already deposited in CrossRef.

Ludo Waltman , one of the authors of the CWTS paper agrees.

Indeed, Elsevier is carefully depositing its references in @CrossrefOrg, but it does not make references openly available; Springer Nature does make references openly available, but a large number of references in books have not been deposited in @CrossrefOrg at all
— Ludo Waltman (@LudoWaltman) January 20, 2018

He estimates if Elsevier made all citations already in Crossref open the open citations coverage of Scopus citations will jump from 35% to something in the range of 55–60% (for all material). Add some retrospective coverage of missing references from Springer-Nature, and Open Citations is already in striking distance of the Scopus citation index.

Of course the sticking point is whether Elsevier will make their citations open. Understandably they will be reluctant to do so because it helps their competitors — particularly Digital Science’s Dimensions strengthen their service relative to Elsevier’s Scopus.

How are open citations currently generated and used?

Making citations open and available for extraction is one thing, but creating services that actually extract it and make it usable by ordinary people is another matter. Here are some use cases of open citations, I’m familar with

OpenCitations Corpus — via SPARQL

One of the main publishers of open citations currently is the scholarly infrastructure organization OpenCitations, that publishes a database of open citation data called the OpenCitations Corpus (OCC). They are also a founding member of I4OC.

Currently, OpenCitations has been ingesting open references from the Open Access Subset of PubMed Central. The Crossref API and the ORCID API are then queried to check and enrich the metadata.

At the time of writing this organization does not ingest open references directly from Crossref but this is being planned. They are moving the system to more powerful hardware to increase the ingestion rate, in hopes to expand their coverage to eventually match Web of Science and Scopus.

The data from OCC is available in linked data format which traditionally is difficult to query as few people are comfortable with SPARQL.

As such they have just launched a new interface OSCAR at http://opencitations.net/search that might be easier to use.

http://opencitations.net/search

VOSviewer — bibliometric mapping tool

I have mentioned Vosviewer a couple of times in the past, and as earlier mentioned VOSviewer works with the Crossref API, so the more citations are made open, the richer the information users will see.

Vosviewer uses Crossref API

Citation Gecko — create citation maps from seed papers (Newly added June 2018)

Possibly the most interesting development lately for me was the launch of Citation Gecko - a novel literature mapping tool that allowed you to map out your research using “seed papers”.

The key here is that Citation Gecko is able to leverage citation links between seed papers and other papers to help highlight possible papers of interest.

For instance it could reveal papers that were cited frequently by your seed papers (co-citations) or conversely papers that cited your seed papers a lot (bibliometric coupling) among other tricks.

But where is the data from?

An earlier tool I was familar with called whocites did something similar but scraped the data from Google Scholar, which was very slow and led to a lot of captchas.

Citation Gecko is a lot faster and uses open citations from OpenCitations,Crossref and Microsoft Academic.

In particularly, I suspect a ton of citations come from Microsoft Academic but in my testing it was sadly pretty easy to hit the API rate limits for that, which shows the importance of making citations open.

In Wikidata and Scholia

Another even less known use of open citations is via Wikidata ,Wikicite and Scholia projects. I’m planning a long series of posts on Linked data focusing on this so I won’t discuss this much here.

But essentially open citation data can be ported into Wikidata so one can do SPARQL queries like “Top cited female researchers in Denmark”., or create citation graphs of articles or people.

In Primo via Citation Trails

I think though it’s more likely users would have encountered it in the library discovery service Primo but not realise it.

Primo has a citations trails feature that lists Crossref as one of the sources. Surely that is citations made open by publishers. As noted by Exlibris these citations are less than in Scopus and Web of Science.

But by far, I think going forward the main way people are going to access data based on open citations will be via Digital Science’s Dimensions.

Digital Science Dimensions and it’s use of open citations

I won’t write a lot about the background of Dimensions, Roger Schonfeld has a good piece breaking the news about it. and I will be reviewing it soon.

But for the purposes of this piece, the most significant thing about Dimension is that the data is at least partly based on open citations from I4OC.

I’m pretty sure Dimensions goes beyond it , as it is a combination of input and expertise from 6 different teams including ReadCube, Altmetric, Figshare, Symplectic, DS Consultancy and ÜberResearch and other publisher partners.

It currently boasts 89 million publications and 870 million citations, which is substantially beyond the number of open citations in Crossref I believe.

When I enquired on how much more was in Dimensions compared to via open citations, Dimensions had this to say.

@aarontay great question — quick answer: in addition to I4OC Dimensions is built on improving discoverability of +50 million records by processing the full-text — not only references but also acknowledgements. Some of them are part of I4OC data, some not. #moretofollow #takestime https://t.co/NMrynOBBq2
— Dimensions (@DSDimensions) January 20, 2018

On a sidenote, Dimensions is taking an inclusive approach so has more items than Scopus, though the number of citations is currently still substantially lower than Scopus, so it appears to have as many if not more items than Scopus.

However one wonders if there is a Elsevier sized hole in the citation data in Dimensions, given that those references are not made open. Are the additional layers that Digital Science build on top of open citations sufficient to fill in this gap? Interesting questions to ponder.

Another one to consider, while you can access the citation data from Dimensions (including the open citations) for free, there are limits to what you can do.

Bianca Kramer a leading librarian in Scholarly communication makes a distinction between products making use of open data and those that are truly open.

In a comment to Roger Schonfeld’s piece on Dimensions, she writes

“In practice though, Dimensions, while perhaps partly building on publicly available data (e.g. from oaDOI), is not contributing to it. The freely accessible version of Dimensions might be very useful for certain purposes, but it doesn’t allow access, export and (re)use of the underlying (meta)data, thereby remaining a commercial party’s closed silo. This is very different from building on open data and, as one business model, charging for the value of all (in a paid model) or some (in a freemium model) of these functionalities, while ensuring that the underlying data are and will remain publicly available. Then citation data would also no longer be a commodity, but truly a public good.”

As the size of open citations grow, more and more services will sprout up to exploit the data, it will be interesting to see what business models these new services will provide.

Conclusion

I hope this tour of open citations , it’s scale compared to other citation indexes and how it is used has been useful.

It also seems a new rivary might be brewing. The library world has long witnessed the struggle between Proquest and Ebsco in the library discovery space. Both serve roles as library discovery providers (with a central index) while owning a portfolio of content. This has famously led to stand-offs where both sides refused (and still refuses) to share metadata and full-text to each other’s central index and poor or totally lack of integration between products and services (e.g. link resolvers, library management systems e.g. Alma, Folio) that belong to their stable of products.

Roger Schonfeld proposes that a similar lock-in situation with perhaps even more far reaching consequences around researcher workflow might be emerging with a duopoly with Elsevier on one-side with their stable of services and potentially Digital Science (and possibly with aid of co-owned Springer-Nature) on another side. Dimensions vs Scopus could just be the first salvo in a long battle ahead.

This was first posted on Jan 22, 2018 at http://musingsaboutlibrarianship.blogspot.sg/2018/01/understanding-implications-of-open.html