Analyzing Books Cited in English Wikipedia

Matt Miller
4 min readApr 9, 2018

--

Citation data used on Wikipedia was recently released connecting the identifiers of source materials to the Wikipedia article using them as references:

I was curious about the type of books that were being used in the Wikipedia ecosystem. When were they published, what authors are prevalent/influential, what subjects were most common, etc. Using the citation data released and OCLC metadata APIs I gathered some stats about books cited in English Wikipedia articles.

Numbers

Citations were released for each Wikipedia language site. To make things scoped I just looked at the English (en) Wikipedia articles. There were still quite a lot of citations:

3.79    million citations
1.7 million ISBN citations (books)
684,965 unique ISBNs

This means out of all the book citations on english Wikipedia there were 684K unique books referenced. I took these ISBNs and ran them through various APIs to gather metadata about each book.

Year Published

When were most of the books being cited published?

Book count by year published

(see full graph)

The majority of the books being used were published between 2000 and 2013 peaking in 2007:

1999  19,379
2000 21,908
2001 22,393
2002 24,393
2003 26,782
2004 29,326
2005 30,283
2006 31,702
2007 33,039
2008 30,625
2009 29,421
2010 28,975
2011 25,856
2012 24,438
2013 24,111
2014 17,474
2015 12,012
2016 10,611
2017 6974
2018 927

You can see the full range of data on this google sheet.

I made a list of pre-1900 books cited, to see what early books were being used on what article. There are some interesting examples but also a lot of bad date metadata.

Authors

We can think about people or organizations who author cited books as influential in two different ways.

  1. A lot their unique works are cited on Wikipedia (quantity, many works represented)
  2. A lot of different articles cite their works (maybe the same work cited in 1000 different articles)

The first:

Top 50 Authors by Most Article References

(open in new window)

In this graph we can see for example R.L. Stine has 362 works cited on Wikipedia (or ISBNs included in the page) but only on 18 different articles. Oppposed to the #1 spot, American Council of Learned Societies which has 1,250 works cited and across 3,858 different article pages. You can see the top 10,000 authors by work count in this sheet.

The second approach:

Top 50 authors by most articles

(view in new window)

In this case we can see an author is influential not because of the quantity of unique works cited but by the number of articles citing their work. For example Holmesby, Russell has nine books cited, but in over 7,000 articles. He wrote The encyclopedia of AFL footballers : every AFL/VFL player since 1897 and is cited on every football player’s article.

You can view the top 10,000 authors by article count in this google sheet.

Holding Count

I was curious if unique books being used were widely held at libraries and other institutions. I found this stat by using OCLC’s Classify Eholding and Holding count for each work.

Books held by institution

We can see in this graph that the majority of books cited are held by 0–202 institutions. We can break down that 0–202 group even more:

Books held by institution 0–202 group

The result contributes to the idea that most books cited on EN Wikipedia are not widely held (based on OCLC’s data). You could rationalize this by saying articles require domain specific literature which would not necessarily be held by a large number of institutions.

Subject Headings

This is probably the least interesting aspect but we can see out of these books what (FAST) subject headings are most prevalent:

Top 50 subject headings used

You can see the top 10K list in this google sheet.

Data

I plan to do some more work on this but I’m making the data available to play with if you would like to:

684,965 records
New Line delimited json file (each line is its own json object)
Fields:
'title' : 'title of the book'
'isbn13' : 'isbn 13'
'year' : 'year published'
'isbn10' : 'isbn 10'
'oclc' : 'oclc number'
'authors' : 'array of authors'
'holdings': 'holdings count from oclc'
'oclcOWI' : 'oclc classify ID'
'google' : 'google books id'
'pages' : 'array of wiki article titles'

Download Data

There were around 20K records I could not find any metadata for. These could be resources that were self-published, or not in OCLC/Google’s book ecosystem, or just bad ISBN numbers.

I think some interesting future work could be using these resources as connectors to library classification systems, LCC or subject headings to Wikimedia ecosystem. It could also be used to map out what areas on Wikipedia are underdeveloped by comparing to LCC or other knowledge organization systems.

Outside of books also looking at the metadata behind the DOI and other citations in addition to the book metadata could also be revealing.

--

--