What’s a citation good for, anyway?

We put a great deal of value on the citation in the academic and library worlds: we count them and collect them, teach people how to use and make them, use them to track down objects and judge quality with them. But we rarely actually articulate the many ways in which citations are used, or what makes one citation valuable and another poor quality, and how to tell.

We also rarely ask these questions on Wikipedia, although we spend a great deal of time — hundreds of thousands of person-hours — creating and verifying citations. We hang notability of entire topics on citations; we emphasize them as crucial to our success as a useful reference source. But we rarely interrogate them. While there’s much guidance on the kinds of things we’re hoping to cite, there’s little on the ideal citation model itself. Citation style on Wikipedia has haphazardly followed a few different academic models and at least on the English Wikipedia, these have merged into a folk style all its own. Are the ways we cite sources sufficient for what we are trying to do with them? This is a crucial question that is not to my mind answered.

So let’s start from first principles and articulate what a citation is used for, outside of Wikipedia: in an academic paper, in a book, in a court case or patent — anywhere they might show up.

1. First and foremost, a citation acts as an identifier of a unique work. A citation tells you that a certain thing — often a written thing, but also perhaps a visual work, or computer software, a statue or movie or something else in a fixed medium — is meant.

Mowat, Diane, Paul Fisher. Johnson, and Mark Twain. The Adventures of Huckleberry Finn. Oxford: Oxford UP, 2007. Print.
A citation in MLA format to a new edition of Huckleberry Finn by Mark Twain, first published 1885, with crowdsourced data and formatting by the popular Citation Machine website. Note the punctuation errors and that the first author is, confusingly, not Twain. (I have not checked this against the book itself). What they seem to mean is this work, which is actually a simplified and retold version by Mowat for young readers, not the original.

Of course, a citation usually means (to follow the FRBR model) the work, not the instantiation of the work. When I cite a journal article, or Huckleberry Finn, I usually mean the general idea of that article, or the book written by Mark Twain, not a specific copy of the journal or the book. This isn’t always true though. Anyone who writes about rare books, or art, or webpages, often means to cite a particular unique thing — that copy of that painting in the Louvre, that particular post on Reddit — and extra information must be given in those cases, such as where that unique thing is located.

More commonly, as with the Twain example above, the specific instantiation — the exact book I have in my hands — might not matter, but the version of the work — the specific edition, translation, or printing — might matter a great deal: a page citation in the 2nd printing of the Indian paperback edition of a textbook will have an entirely different page number from the original UK hardback edition, despite citing the same material. One translation of Ovid is not the same as another. This is accounted for in citation styles that deal with literary works, but is treated haphazardly in Wikipedia.

Bear in mind, uniquely identifying a work can be easier said than done — the library and software community is still trying to figure out how to appropriately cite computer software, or video games, such that the right version is identified. We also often confuse what does need a unique identifier (a work that only has one location, such as a single webpage) with a work that doesn’t, but happens to have several locations where it might be found, including the one cited (the copy of Huckleberry Finn on Project Gutenberg).

2. A citation’s next job, nearly as important (and of particular interest to reference librarians), is to enable a person to locate the specific work that is meant.

Now that I know what it is, can I get my hands on it to read it? Where does a copy of this work exist? To answer these questions, it’s not enough to just give the author and title of a journal article — if you do, you are putting the burden of searching for extra information in a bibliographic database on the reader. Instead, we give the source title of the journal, and the volume and page — a crude but effective locator system for finding an article in a long run of print journals. Nowadays, unique online IDs serve that purpose, saving everyone’s time by taking you directly to the online location of the article. But, just having the ID is not enough in a citation: if there’s a typo in the number, as there may well be, you need some fall-back information to locate the article; and if there’s no online ID, you usually need the volume, page, etc. even for finding the online copy. Often, it’s only the combination of the author, title and journal that uniquely identifies an article, and makes it possible to find; more information is better.

Unique report numbers, journal volumes, encyclopedia titles: these are all used for location. Generally, these pieces of data can only physically locate the source if they are first mediated through a (sometimes arcane) search system — I use a library catalog to translate a journal title into possible holdings locations, for instance. To aid in all this, the precise form of the citation punctuation — the fact that we put issue numbers in parentheses and pages after that — does serve as a shorthand to these locator fields, and our academic training supports knowing which fields to search on in what search systems (book titles in catalogs, article titles in bibliographic databases, etc). So without these metadata fields, we’re lost.

But even this function, historically, is highly discipline-dependent in citation styles. See this citation:

Aguirre, J. E., Ginsburg, A. G., Dunham, M. K., et al. 2011, ApJS, 192, 4

which is formatted in the recommended format for the American Astronomical Society, a major publisher in the field of astronomy (and is in fact copied from their author instructions). How meaningful is this citation to a historian, a biologist, or a non-academic? As a science librarian, I can tell you off the top of my head that article was published in the Astrophysical Journal supplement series, volume 192, page 4, and that furthermore it’s probably online (because it was published in 2011) but also that it is not included with the journal ApJ (because it’s a supplement). Everyone else who wasn’t trained as a physicist or a physics librarian, however, is left to search around for the wretched thing without an article title or a unique identifier to help them. How meaningful or useful is such a citation format outside of the astrophysical community? Tracking journal abbreviations, the vagaries of report numbers, and odd metadata formatting is the bread and butter of reference librarians, but is also necessary for a truly useful citation system in Wikipedia that draws on the academic literature which is, without exception, inconsistent and field-specific.

Another citation example:

Bennett, P., “Engine Oils and Engine Durability,” SAE Technical Paper 690767, 1969, doi:10.4271/690767.

What is it? Neither fish nor fowl, this is not an article or a book but an SAE technical paper, a kind of report published by the Society of Automotive Engineering and given a unique number. First these reports were published in print in big books with indexes, now they are online; the report number makes perfect sense — if you already know what it is. There is a DOI in this citation to the online (paywalled) version, but that’s a recent addition — for older citations, you’d just have to know to look the paper up by that number in a certain series book. (Plus, for actual findability, many academic engineering libraries have these older SAE papers in print but not online). So identifying all parts of this citation is crucial for writing a correct citation. The lesson is that restrictive citation systems that don’t leave flexibility and room for things like unexpected and unique report numbers will fail in always producing citations that are effective identifiers or locators.

3. Identifying a thing in the world: related to the above two functions, but slightly different in execution, a citation might be to a thing that exists in the world but is not a human-created work: a chemical, say, or a star, or a species. There are complex, conflicting, overlapping and occasionally proprietary identifying schemes for all of these types of things (and everything else in the world that humans have studied as well). Confusingly, these systems, like article citations, sometimes but not always conflate location and identification functions. ID systems for interstellar objects use international IDs and names which can be used against reference material to help you find the object in the sky, whereas zoological identifiers use a Latin name and the reference to the author/year of the first paper identifying the species, which acts as a kind of unique identifying system and a handy pointer to the literature in one.

Vanessa (Vanessa) Fabricius, 1807
— an example of a species reference, from the Wikipedia article about zoology author citation style

Chemical IDs, such as the proprietary but universally used CAS numbers, however, simply serve to disambiguate that particular chemical, while giving you no information about where the chemical might actually be procured from or found.

4. Credit where credit’s due: don’t plagiarize! We tell our students this, drilling this lesson into them practically from grade school on. Give credit to people whose ideas you are using! We use citations as a mechanism of acknowledgement — letting the world know who we are building on. How effective this is, of course, depends entirely on how clear and precise the citation is and what the text itself that holds the citation says. Citing an entire book, when what you’re quoting is a sentence on a particular page, doesn’t do much good for acknowledgment; neither does citing a whole paper when what you used is a specific figure, which was perhaps itself reused from another source. Here, granularity and specificity to the extent possible is key.

5. Getting credit: here is where academia comes to the fore. We hang entire careers on writing papers and citing them in academia, and having a highly-cited paper is a mark of prestige. We build elaborate systems for counting who cites who just for this purpose (though all of them leave things out and only work for some fields). We want to know if others have used our work, and we want to trace where and why. Human pride in our own work powers much of scholarship. As an academic community, at least in the American system, citations and citation counting is so important that it is not yet clear how to get academic credit (i.e., tenure review) for things that don’t follow a traditional citation and review model, like blog posts: we only know how to deal with work that falls into particular molds. The burgeoning field of altmetrics seeks to change this by counting who cites who for all sorts of online work, but it will be slow going. The issue of getting credit in citations is so important that entire disciplinary standards and ethical matters hinge on whose name is listed first in a list of authors, with it being understood that the first author — or in some fields, confusingly, the last author — is the most important and deserves the “most” credit, unless of course there’s another standard or it’s otherwise specified.

6. Educating others about the field and acknowledging our own roots: we cite to let others know we know the field, and to signpost it for them. If I write a paper about the history of encyclopedias, the chances of me citing Diderot are extremely high, whether or not I focus in on the Enlightenment. We like to acknowledge the preeminent works in the field, the first papers, the groundbreaking papers, the central pillars of thought in the community. This is especially true in things like textbooks and encyclopedia articles where the bibliography is meant to educate and guide readers. This can be a distorting factor, of course, in how much a given central paper is cited versus any other paper, whether it’s truly used or not. Of course, another factor that limits the usefulness of such an educational bibliography is how much information is included: one of the biographical dictionaries I use a lot (The Dictionary of Scientific Biography) has extensive and very good bibliographies to comprehensive works about the subject, but doesn’t note the language those works are written in, which isn’t very useful if you track something down and it turns out to be in a language you don’t speak.

7. Padding our resumes: not unrelated to the above two factors, authors are people too, and the urge to cite ourselves — whether rightly (we are making mention of our previous work because we are building on it) or wrongly (we are unnecessarily making mention of our previous work) is strong. I once had a graduate student tell me he never used article bibliographies, because they were all just self-cites anyway. That may be a little extreme, but it’s not entirely untrue.

8. Judging the quality of a work: when an academic reviews a paper, we also look at the citations — is there a comprehensive survey of past work, have the central things in the field been cited, have the things cited been published by reputable publishers and in an appropriate time period? This review gets more stringent the more comprehensive and important the work — people’s dissertations do hang on having a good literature review. All of this of course is highly subjective, and is subject to all of the factors above.

9. Legal precedent: this is a special way in which citations function in the law and in some documents such as patents, where citations to past cases (or inventions) establish what precedent has been invoked. Despite (or perhaps because of?) having such a weighty and important purpose, legal citation is probably the most arcane and impenetrable citation style there is, almost useless to people without training in it. Law is also the only field I know of where there are actual classes — entire academic classes — on how to cite things, which may mean that the lawyers have gone a bit overboard in making it difficult.

Meier v. Said, 2007 ND 18, ¶ 22, 726 N.W.2d 852.
An example of a case citation, from the excellent Introduction to Basic Legal Citation. Though impeccably formatted, how is this useful for anyone who’s not trained in Bluebook?

Identification, findability, acknowledgement, getting credit, showing off one’s knowledge and establishing precedent: that’s not bad for a one- or two-line code. But what else do citations additionally do in Wikipedia?

1. Establish notability: we judge notability, at least in the English Wikipedia, based on a rather elaborate and not particularly scientific combination of the number of citations and their “quality”, by which we mean their pertinence to the topic — does the citation cover it in any depth — and the quality of the publisher of that citation.

This quality piece is tricky, because it is so very dependent on the subject being described. If we’re talking about astronomy, it’s not so hard, perhaps; I trust the reviewing standards of the American Astronomical Society, and the other publishers in the field, and I trust that an article published in one of their journals is likely truthful and about something new. (Though of course we also have the arXiv, where most astronomy papers are published today, and poor-quality journals that also don’t review or review badly). But what if our field is general news, or celebrity biography? Who’s to say that a particular newspaper or gossip site is or isn’t reputable, factual or neutral? At any rate, epistemological concerns aside, a citation should have enough information — date and sourcing and language and extent of coverage of the topic — to let us know if it really is useful for establishing notability or not. One line in the New York Times does not a full biography make, but can it be used for establishing notability? We can’t answer that question unless we first know the depth of our citations.

2. Establish quality: as per above, once the topic is written about, we often use the citations to judge the overall quality of the entry, and its approximate trustworthiness, for better or worse.

3. Provenance of facts: unlike most other types of technical or academic writing, Wikipedia is perhaps unique in requiring that everything come from an outside source. We struggle with this; does a footnote at an end of a paragraph mean that the whole paragraph came from that source, or just the last sentence? How can we be sure the nuance of that sentence is actually backed up by whatever the source says? Is the source itself trustworthy, and according to who (see above)? Do we have any special guarantee the author looked at the source — can we access it ourselves, and how? (Arguably, a citation to a rare manuscript or out of print book that’s only held in one country’s libraries is as useful, to most readers, as no citation at all).

Where does this leave us?

Citations must be flexible, to deal with the wide variety of identifier schemes and odd citation structures that exist in the world. At the same time, they should respect historical and long-engrained formatting that enables them to be human readable-and-parsable and useful for their locator duties. They should, ideally, indicate the relationship between the source and the new text — this is a job that has never been historically possible, but may be with new online identifying and annotation systems (I think about the now nearly 20 years old experimental wiki system PurpleNumbers from Doug Englebart, which as a proof of concept is still one of the best line and paragraph identifiers I’ve seen).

Citations that are dependent on points in time or unique and possibly ephemeral instances (a dynamic news webpage) should indicate that. Citations should have semantic data sufficient to allow both giving and getting credit. And they should be transferable between different systems: journal abbreviations should map to journal titles, and back again. Law citations should be expandable for the rest of us. And unique IDs should map to the content without losing the information that is encoded in the rest of the citation, however slight that is.

The citation is a small and underappreciated miracle of scholarship: an imprecise encoding device that nearly everyone gets slightly wrong (I’ve never seen a paper yet, published or not, that didn’t have some sort of formatting issues or typos in the citations) — and yet, despite all this, for those trained in their ways they are instantly recognizable and serve a multitude of purposes. Citations deserve better than we give them: they deserve to shine.

written for the first WikiCite hackathon, May 2016