Analyzing DOI Citations in English Wikipedia
Building on my previous look at book (ISBN) citations used in English Wikipedia using the data recently released I turn to the other prominent citation type in the dataset: DOIs. These DOIs mostly represent journal articles referenced on Wikipedia.
Numbers & Process
Same as the book citation analysis I’m only looking at English (en) citations released.
3.79 Million citations
1,211,807 DOI citations
835,517 Unique/resolvable DOI citation analyzed
To gather the data I politely used the CrossRef API to gather data they had for each DOI. This API returned a lot of data that can be aggregated. I was curious about a lot of the same data features found for ISBNs but since the majority of these references are for journal articles it also introduces the notion of publisher and access (not saying that is untrue for monographs, but is more of an issue for journal access)
A DOI can point to anything, but as expected the majority of these DOIs pointed to a journal article:
Publisher and Database Access
Who is the gatekeeper for all these referenced articles? I aggregated by publisher and grouped the results:
As you would probably expect the largest publisher is Elsevier followed by Springer Nature and Wiley-Blackwell. You can view the full list in this sheet.
But I was also curious as to how you gain access to the articles, so for each publisher I resolved an article’s DOI and took note of what website it resolved to. This is the site that you would need to have institutional access to or pay to view the article (if it is not free or open access). This is different from the publisher stats because multiple publishers might be made available through the same full text service:
The percentages go up a few points for the large publishers. View full data in this sheet.
Rates of course vary from publisher to publisher and journal to journal for individual access. And not all of these articles are behind a paywall, but if you took a pretty lowball price of $25 per article to access, you could estimate it would cost $3.7M dollars to access all the Elsevier content alone. It would be a interesting project to get the access cost for each article and be able to literally put a price on the knowledge cited in English Wikipedia.
Similar to the ISBN results the bulk of the DOIs cited are from 2000–2015
Although I did not investigate the spike in 2015 reference, an outlier in the pattern. View all year data in this sheet.
UPDATE: I did look into the 2015 spike, it appears there was a lot (17,668) of citations added from “IUCN Red List of Threatened Species” 2015 edition. These are fact sheet pages about threatened species, like these: http://doi.org/10.2305/IUCN.UK.2008.RLTS.T1808A7651803.en (Gray-bellied night monkey)
http://doi.org/10.2305/IUCN.UK.2003.RLTS.T41733A10550550.en (Nervous shark)
The most commonly cited journals were science publication:
Cited by Others
CrossRef offers a value for the number of other resources that cite the resource in question. So how many times do other journal articles cite this specific article for example.
The results are that most of the DOIs were cited by 0–20 other DOIs.
Here is a link to the data used in this process. It is a new line delimited JSON file of the CrossRef response for each DOI (‘wikidoi’)and the Wikipage it was found on (‘wikipage’)
Download Data (⚠️ 1GB which expands to 4GB)
This data combined with the ISBN data provides are fairly complete view into the materials cited on English Wikipedia. I would like to work with the combined DOI/ISBN data to connect bibliographic systems and the en Wikimedia ecosystem more closely.