Citation data used on Wikipedia was recently released connecting the identifiers of source materials to the Wikipedia article using them as references:
I was curious about the type of books that were being used in the Wikipedia ecosystem. When were they published, what authors are prevalent/influential, what subjects were most common, etc. Using the citation data released and OCLC metadata APIs I gathered some stats about books cited in English Wikipedia articles.
Citations were released for each Wikipedia language site. To make things scoped I just looked at the English (en) Wikipedia articles. There were still quite a lot of citations:
3.79 million citations
1.7 million ISBN citations (books)
684,965 unique ISBNs
This means out of all the book citations on english Wikipedia there were 684K unique books referenced. I took these ISBNs and ran them through various APIs to gather metadata about each book.
When were most of the books being cited published?
The majority of the books being used were published between 2000 and 2013 peaking in 2007:
You can see the full range of data on this google sheet.
I made a list of pre-1900 books cited, to see what early books were being used on what article. There are some interesting examples but also a lot of bad date metadata.
We can think about people or organizations who author cited books as influential in two different ways.
- A lot their unique works are cited on Wikipedia (quantity, many works represented)
- A lot of different articles cite their works (maybe the same work cited in 1000 different articles)
In this graph we can see for example R.L. Stine has 362 works cited on Wikipedia (or ISBNs included in the page) but only on 18 different articles. Oppposed to the #1 spot, American Council of Learned Societies which has 1,250 works cited and across 3,858 different article pages. You can see the top 10,000 authors by work count in this sheet.
The second approach:
In this case we can see an author is influential not because of the quantity of unique works cited but by the number of articles citing their work. For example Holmesby, Russell has nine books cited, but in over 7,000 articles. He wrote The encyclopedia of AFL footballers : every AFL/VFL player since 1897 and is cited on every football player’s article.
You can view the top 10,000 authors by article count in this google sheet.
I was curious if unique books being used were widely held at libraries and other institutions. I found this stat by using OCLC’s Classify Eholding and Holding count for each work.
We can see in this graph that the majority of books cited are held by 0–202 institutions. We can break down that 0–202 group even more:
The result contributes to the idea that most books cited on EN Wikipedia are not widely held (based on OCLC’s data). You could rationalize this by saying articles require domain specific literature which would not necessarily be held by a large number of institutions.
This is probably the least interesting aspect but we can see out of these books what (FAST) subject headings are most prevalent:
You can see the top 10K list in this google sheet.
I plan to do some more work on this but I’m making the data available to play with if you would like to:
New Line delimited json file (each line is its own json object)
Fields: 'title' : 'title of the book'
'isbn13' : 'isbn 13'
'year' : 'year published'
'isbn10' : 'isbn 10'
'oclc' : 'oclc number'
'authors' : 'array of authors'
'holdings': 'holdings count from oclc'
'oclcOWI' : 'oclc classify ID'
'google' : 'google books id'
'pages' : 'array of wiki article titles'
There were around 20K records I could not find any metadata for. These could be resources that were self-published, or not in OCLC/Google’s book ecosystem, or just bad ISBN numbers.
I think some interesting future work could be using these resources as connectors to library classification systems, LCC or subject headings to Wikimedia ecosystem. It could also be used to map out what areas on Wikipedia are underdeveloped by comparing to LCC or other knowledge organization systems.
Outside of books also looking at the metadata behind the DOI and other citations in addition to the book metadata could also be revealing.