The work of paper scoring in peer review
This post is about the topic of ‘paper scoring’ in review processes. The key message is that scoring is largely an internal organisational tool, yet is one that we have made outward-facing, meaning that scores have also become partial accounts of the organisational process of peer review work (hence opening another front on the war against author disgruntlement). As its fodder my writing here focusses on the largest international conference on human-computer interaction research, CHI, which I was an Associate Chair (AC) for the 2014 edition of the conference.
My experiences confirmed that the process is clunky but mostly ‘works’. Yet, one of the major frustrations for authors submitting papers to CHI is the lack of transparency over the program committee (PC) meeting, where decisions get made about papers. As such I hope this little note in some way helps with these frustrations.
The PC meeting resolves the fate of both 10 and 4-page papers; which will be accepted to the conference, and which will be consigned to rejection (a better term would be ‘recycling bin’). Given that papers published at CHI are taken by some to significantly contribute to academic standing both within the HCI community and in employment matters, it is unsurprising that the peer-review process attracts equal measures of time, effort and controversy. Furthermore, given the number of papers being submitted (2000+ as of 2014), and the requirement to obtain three anonymous reviews for each paper, the PC has expanded to the size of a conference itself with hundreds of attendees. Since you can’t realistically get this many people in a single room and discuss 1000+ papers (i.e., the number of papers that don’t get a straight rejection), the PC process has become organised into various ‘subcommittees’.
The various roles assumed by committee members is relevant here. Subcommittee Chairs (SCs) run the subcommittee and assign papers (10–12) to ACs in attempted accord with the ACs’ expertise and interests. The critical feature of this organisation is that it is only really ACs who have a) the rights-to-talk vis-a-vis the content of papers assigned to them and b) correspondingly, the rights-to-decision-making regarding papers. So although SCs run the process and interrogate ACs regarding the papers, it is ACs who, through this interaction with the SCs and other ACs, find that they have assumed significant responsibility for the decision-making process. More on this later.
Many of the complaints about the process centre on the role of scoring. It seems to me this is curiously one of the most overlooked features of reviewing; I’ve rarely heard anyone directly talking about it. Some examples:
- My paper’s score was X, however another paper I know about with score Y (where Y < X) got in whereas mine was rejected;
- My carefully crafted rebuttal was ignored and the ACs and / or reviewers just put “I’ve read the rebuttal but it didn’t change my score” to shut me up;
- The scoring doesn’t reflect the tenor of the reviews.
I’m sure I’ve used these all at some point. The curious thing is that the persistent and almost ritualistic annual hand-wringing fret-fest that is had by some members of the HCI community each year focusses almost entirely on the quality of reviews, with nothing said about the scoring procedures.
On scoring
What is this matter of scoring? In the process, each AC has found 3+ reviewers for every paper they are assigned and subsequently produces a meta-review. Each review contains a lump of text (the review), an expertise level, and an overall ‘score’ for the paper between 1 and 5 (with increments of 0.5). The score for a paper (as of CHI 2014) is presented to the reviewer with the following associated text:
1.0 Reject: I would argue for rejecting this paper
1.5 Between reject and possibly reject
2.0 Possibly Reject: The submission is weak and probably shouldn’t be accepted, but there is some chance it should get in
2.5 Between possibly reject and neutral
3.0 Neutral: Overall I would not argue for accepting or rejecting this paper
3.5 Between neutral and possibly accept
4.0 Possibly Accept: I would argue for accepting this paper
4.5 Between possibly accept and strong accept
5.0 Strong Accept: I would argue strongly for accepting this paper
The appearance of the score in this review form is that of a pure indicator of a (linear) sliding scale from reject to accept. We might think naively that the score for a paper is a general overall indicator of its quality; it is certainly presented to authors in that way and likely leads to comparisons between authors over paper scores as an indicator, not only of quality, but also the ‘score proximity’ of the paper to acceptance in the conference proceedings and thus some expectations about how decisions may play out.
Yet in reality scoring is far more complex than this and serves multiple purposes. Much of this complexity is entirely hidden to authors, particularly those who have not served on a committee before. Scores become resources for discussion and are employed by ACs for purposes of rhetoric. Note, not rhetorical purposes, but instead the work of convincing, countering, reminding, as well as indicating uncertainty or certainty, rebutting a point from a 2AC, confirming a point with a 2AC, and so on.
But it is uniformity and comparability which form the two chief practical achievements of scoring as a method for peer review:
- Scores give the appearance of uniformity but can have different meanings. For instance, scores of the AC and scores of the reviewers are two different things. Reviewers provide the approximate range and spread, and ACs tend to provide a value within that range, yet the AC score means something different even though it is drawn from the very same set of ranking descriptions. ACs can use a score to indicate a preferred stance towards the reviewers (e.g., siding strongly in a split case), but can also be used in an attempt to elicit a thoughtful rebuttal (e.g., AC scores higher than the average). The uniformity of scoring tends to wear down any trace of these subtleties when it is used in aggregate ways, such as for comparability purposes (next).
- Yet at the same time scores provide a useful comparability ‘fudge’ across papers. Although most would probably accept that there is little to no relevance in comparing one paper presenting a Fitts’ Law study of mobile phones with another paper presenting a grounded theory study of MMORPGs, scoring provides a useful conceit to practically do this. Scoring enables committees to accountably ignore a significant number of submissions by deciding an approximate cutoff value which itself is calibrated based on ‘external factors’ including venue size, cost, number of days, number of rooms available, etc.
Of course (assuming you, the reader, is a reviewer) we all know what the ‘real work’ of paper reviewing actually consists of. Each paper is and can only be evaluated on its own basis. Each paper to review is a particular task that is not easily made uniform nor is it made easily comparable.
Scoring systems tend to obfuscate this actual nature of peer review for authors receiving reviews, even though these authors are probably reviewers themselves. Two apparently equally-scoring papers are actually evaluated on a case-by-case basis, with the particular relevant issues ‘ad-hoced’ together by reviewers and ACs to construct this-particular-criteria which are and can only be relevant to this-particular-paper at this-particular-moment. While there are ‘generic’ matters which are said to apply to all papers such as ‘contribution’, exactly what constitutes a correct evaluation of the contribution for any given paper cannot be determined in the abstract beforehand, but rather is to be discovered in and through the discussions for that particular paper. The way scores change is indicative of this. The paper’s score is accountable to the discussion that is had, and scores are modified to preserve the congruity of the agreements reached by members of the committee who have the rights to make decisions about papers.
This kind of activity is reminiscent of that described by Zimmerman and Pollner (in their classic article “The everyday world as a phenomenon”) as ‘corpusing’. Their account explains how the methods we use to perform everyday practical actions (like ‘getting the bus’, ‘performing an experiment’ or ‘reviewing’) are at once generic and particular. The relevancies of the moment, the situation, shape which particular configuration and ‘version’ of these methods is to be deployed; this is the situated ‘corpusing’ work that competent members of society engage in as each setting is encountered, in and as it unfolds.
This perspective poses a challenge to other ways of understanding peer review. Typically it is positioned as a flawed process that needs ‘fixing’ via the introduction of stringent ‘scientific standards’ or words to that effect. Yet we find that, as indicated above, the actual experience of peer-review reveals that there are no particular a priori scientific standards which fit all cases. Furthermore, when we come to consider the very notion of ‘scientific standards’, the application of these ‘standards’ may not be located in any particular actual scientific practice that we may point at. Instead they are those of a ‘generic’, flattened non-existent rendering of science, typically one that compresses the discovering (natural sciences) with the ‘non-discovering’ ones (to borrow Livingston’s amusing characterisation of the human sciences). Examples include the (undirected, unmotivated) drive to consider replicated studies as a ‘good thing’ due to a fear of ‘not being scientific enough’, or the notion that the scientific report is a complete document of instruction that can be separated from the profoundly practical actions to which its production is incurably bound. In this sense, peer review is not, and cannot be considered as a ‘scientific process’. But neither is it a ‘literary’ enterprise.
References
Zimmerman, D. H. and Pollner, M. The everyday world as a phenomenon. In: Jack D. Douglas, ed. Understanding everyday life: towards a reconstruction of sociological knowledge. Routledge, 1970, pp. 80–103
Livingston, E. Making sense of ethnomethodology. Routledge, 1987.
Postscript
Without realising I appear to have been (partly) ‘channeling’ Mark Rouncefield, who makes similar-ish comments on the CHI peer review process.
Originally published at notesonresearch.tumblr.com.