Analyzing DOI Citations in English Wikipedia

Building on my previous look at book (ISBN) citations used in English Wikipedia using the data recently released I turn to the other prominent citation type in the dataset: DOIs. These DOIs mostly represent journal articles referenced on Wikipedia.

Numbers & Process

Same as the book citation analysis I’m only looking at English (en) citations released.

To gather the data I politely used the CrossRef API to gather data they had for each DOI. This API returned a lot of data that can be aggregated. I was curious about a lot of the same data features found for ISBNs but since the majority of these references are for journal articles it also introduces the notion of publisher and access (not saying that is untrue for monographs, but is more of an issue for journal access)

Citation Type

A DOI can point to anything, but as expected the majority of these DOIs pointed to a journal article:

DOIs by Document Type — View interactive chart

Publisher and Database Access

Who is the gatekeeper for all these referenced articles? I aggregated by publisher and grouped the results:

DOIs by Publisher — View interactive chart

As you would probably expect the largest publisher is Elsevier followed by Springer Nature and Wiley-Blackwell. You can view the full list in this sheet.

But I was also curious as to how you gain access to the articles, so for each publisher I resolved an article’s DOI and took note of what website it resolved to. This is the site that you would need to have institutional access to or pay to view the article (if it is not free or open access). This is different from the publisher stats because multiple publishers might be made available through the same full text service:

DOIs by Database/Full Text access — View Interactive chart

The percentages go up a few points for the large publishers. View full data in this sheet.

Rates of course vary from publisher to publisher and journal to journal for individual access. And not all of these articles are behind a paywall, but if you took a pretty lowball price of $25 per article to access, you could estimate it would cost $3.7M dollars to access all the Elsevier content alone. It would be a interesting project to get the access cost for each article and be able to literally put a price on the knowledge cited in English Wikipedia.

Publish Year

Similar to the ISBN results the bulk of the DOIs cited are from 2000–2015

Although I did not investigate the spike in 2015 reference, an outlier in the pattern. View all year data in this sheet.

UPDATE: I did look into the 2015 spike, it appears there was a lot (17,668) of citations added from “IUCN Red List of Threatened Species” 2015 edition. These are fact sheet pages about threatened species, like these: (Gray-bellied night monkey) (Nervous shark)


The most commonly cited journals were science publication:

DOIs by Journal — View interactive chart

View top 10K data in this sheet.

Cited by Others

CrossRef offers a value for the number of other resources that cite the resource in question. So how many times do other journal articles cite this specific article for example.

DOIs by cited count

The results are that most of the DOIs were cited by 0–20 other DOIs.


Here is a link to the data used in this process. It is a new line delimited JSON file of the CrossRef response for each DOI (‘wikidoi’)and the Wikipage it was found on (‘wikipage’)

Download Data (⚠️ 1GB which expands to 4GB)

This data combined with the ISBN data provides are fairly complete view into the materials cited on English Wikipedia. I would like to work with the combined DOI/ISBN data to connect bibliographic systems and the en Wikimedia ecosystem more closely.