Measurement in Academia

On Monday there’ll be a one-day workshop at the University of East Anglia on “Ways of Being in the University”. (The title is purposefully ambiguous). I've been asked to participate in a panel with the above title. I've posted here what I intend to say. tl:dr; metrics are often bad and suffer predictable biases, but we don’t know if they’re worse than the alternative.

I’ll begin by saying this: I think I know why I was invited.

I think I was invited because I'm one of two people in the school who regularly uses statistics in their research.

As such, I might have been expected to be broadly favourable to measuring things, including things like the quality of research and the quality of teaching.

I think I would describe my position as mixed. I suspect, but cannot demonstrate, that the REF and its predecessors probably increased the quality of research in British universities, though in doing so it has probably led to a greater reliance on imported students and staff, has definitely created deadweight losses resulting from universities’ need to ‘manage’ the REF process, and has exacerbated inequalities between universities. I think that the TEF is probably well-intentioned, but I have considerable reservations stemming form the lack of thought put in to thinking about how to measure teaching quality.

I’ll make five main points, which concern academic privilege, the use of metrics as targets, actively biased measures, the unmeasured alternative, and the possibility of good measures supplanting awful measures.

Check your privilege

I think it’s important to recognize that many attempts at measurement are really attempts to manage us, and that not all attempts to manage academics result from an insidious neo-liberal drive.

Let me give you a recent example. Last month, Edinburgh University featured in the Times Higher because of a proposal to require academics to notify their head of department if they were going to be away from their office. Many people characterised the proposals as Orwellian.

I think that those people forgot that most people, if they are not regularly present in their office, end up losing their job. Academics have become used to a quite remarkable degree of autonomy which allows us to turn up at specified times during the week, and otherwise be where we want to be. You can argue that this autonomy is a good thing for the university, because we are more productive when we are autonomous. But although it might be self-harming, it’s not unreasonable for universities to keep a tab of who is regularly present in their office, and require that people to be present more often. By complaining about all attempts at measuring things, we’ll end up looking ridiculous. In other words, we need to check our privilege.

Goodhart’s law

My second point is that many measures, which would be good measures if they had no repercussions for the things being measured, cease to be good measures once benefits and sanctions are attached to them. This is a paraphrase of what in the UK is most often called Goodhart’s Law, but which is elsewhere called Campbell’s law.

Consider, as an example, the use of contact hours as a metric of teaching quality. I think that, on average, in a system where there are no sanctions, a greater number of contact hours is probably an indicator of greater teaching quality. Unfortunately, when institutions are assessed on the basis of their contact hours, there are strong incentives to create new contact hours which do not respond to a pedagogical need, but exist solely to improve the institution’s performance in terms of certain metrics.

The easiest way to improve contact hours whilst keeping staff costs constant is to move from seminars to lectures. But research by Graham Gibbs has shown that “close contact that involves at least some interaction between teachers and students on a personal basis is associated with greater educational gains”. Adoption of contact hours as a metric would therefore likely drive down teaching quality. A more complicated metric, based on the number of contact hours divided by the number of people present during those contact hours, would be better, but would require more effort.

Active bias

My third point is that there are some measures which we should actively resist. In particular, I think we should actively resist the use of teaching evaluations. Teaching evaluations have been shown to be biased, in that female instructors, or rather instructors who students believe to be female, are evaluated more harshly than male instructors. I say, who students believe to be female, because I am basing this claim on a study of teaching evaluations made of an online course, where students had no face to face interaction with their instructors, and where instructors randomly swapped genders.

If that wasn’t enough, teaching evaluations don’t measure learning gain. Where students are randomly assigned to different instructors, and where students are subsequently assessed according to a common standard, students do better if they were taught by instructors who received worse evaluations ([1], [2]).

Does this mean we shouldn’t evaluate the quality of our teaching? No! But it means that we do it by peer-assessment, something that we say we do, but in practice don’t as much as we ought, because it’s expensive.

Is the unmeasured alternative better?

My third point is that although it is in very many instances possible to identify biases in metrics, it is only possible to identify these biases because of the effort taken to collect information. This does not mean that there were no biases before this information was collected: it just means that we were less aware of them.

To give an example: it’s well known that departments do better in the REF if they have a staff member on the REF panel. This doesn’t necessarily mean that panel assessments are biased. It may be that panel members are drawn from departments which do better work. But the effect remains when controls are included for the quality of that department’s work.

There is, therefore, a possible bias in the allocation of QR funding across universities. Does that mean that the system of allocation on the basis of REF or RAE results is more or less biased than the previous system, used in the seventies and eighties, of allocating funding through the University Grants Committee? We don’t know.

Crowding out

My final point is an optimistic one. It’s true that many metrics are being created by government as part of a drive to manage or steer universities more intensively. It is also true that many of these metrics are poor measures. However, that doesn't mean that they don’t represent an improvement on the status quo.

That’s because at the moment, we have a number of private organizations which create metrics which are, quite frankly, abysmal. The Complete University Guide, the Times, and the Guardian all compile rankings of universities. All three of these rankings include, as a component, the average tariff score of students selected for different courses at different universities.

The tariff score is not a measure of the quality of teaching at a university. Rather, it measures inputs. I find it difficult to understand why these papers, if they are sincere in what they do, include a measure of inputs as an indicator of quality. We wouldn't judge a hospital on whether its patients started off healthier to begin with. We shouldn't assess universities’ quality in that way. If the development of the TEF leads to these measures being supplanted, it will be a positive development.