The GRIM test — further points, follow-ups, and future directions
The last 10 days have been interesting. To recap:
- Nick Brown and I published a pre-print of our paper on the GRIM test, a very simple technique for determining if means with small cell sizes can exist, and what it means if scientific papers are reporting ‘impossible’ data — which they often do.
- (Obviously we are now in the process of publishing a non-pre-print version, so that’s happening too.)
- The pre-print has been downloaded 400-odd times, the article has received around 2750 views, my Medium article explaining the test has had around 10K views, and my first Facebook post has around 17.5K views. This is all without university support, or a press release, or any official promo. We just lobbed the paper into the public consciousness like a damp firework with a lit fuse, and hoped it went off.
- An excellent online calculator appeared courtesy of Jordan Anaya. It is wonderfully straightforward, and a lot neater than my code or our spreadsheet. I’d recommend using that if you need to GRIM test some data.
- The first mainstream article has been written about the technique, which is good news. I hope there’s more.
- A firehose of comments, links, and tweets (Twits? Twerts? Insert appropriate noun here, I still think Twitter is Satan Incarnate) have been left / swapped / conferred. Heartfelt thank yous to everyone who left them — I’ve seen some of them, but I’ve also been on holiday for Memorial Day weekend, so I’m still catching up.
I’m hoping here to firstly clear up a distinction I think is important, and then answer some of the questions so far as well.
This is not a fraud test, it is an inconsistency test…
…and yes, the framing is important.
- Inconsistency in numbers doesn’t mean anything more or less than that, it’s not a euphemism. It’s “not compatible or in keeping with its description”, full stop.
- The reasons for “honest mistakes” are many.
- Most researchers who bothered to engage with the process of identifying data errors were reasonable and straightforward about any inconsistencies.
And as I said before, what concerns me is the data we can’t see.
That is, we didn’t uncover any fraud, we found a reasonable list of people publishing a) papers shotgunned with a worrying amount of inconsistencies who b) promised to share data then disappeared, refused to share data, or neglected to talk to us at all.
What horrors await?
“If this test is used for fraud detection, won’t this only catch people who are crap at fraud? Won’t it fail to detect people who make up *data*, and only find people who make up *means*?”
Making up data is a pretty silly crime. It’s like plagiarism in that for you to derive any benefit from it, the deed must be done in full public view. And the more successful your crime is, the more attention your work will garner… that could well make you more likely to get caught.
It’s like stealing money from a casino — a casino is a fine source of money, but it’s a place of drastically increased scrutiny.
Also, “data manipulator” or “fraud” is the academic Mark of Cain, and once you’re bailed up for doing it, you’re forever off everyone’s Christmas card list until the end of time.
(And to extend the casino analogy, steal from the baccarat table and they’ll probably brickbat you in an alleyway.)
The crime graduates from ‘pretty silly’ to ‘borderline thick’ in our new bold era of open data policies (which, as someone who supports open science in general, I am very bullish on… and as someone who is partial to publishing methods papers using existing data for re-analysis, I am thoroughly delighted).
But, overall, yes. Let me be explicit about the capabilities here:
If you invent realistic data, it will pass the GRIM test.
If you p-hack or selectively report real data, it will pass the GRIM test.
If you hide mistakes, don’t report data errors or omissions, misreport cell sizes or degrees of freedom (due to oversight, sloppiness, or incompetence), etc. failing the GRIM test becomes possible.
If you invent your means, failing the GRIM test becomes possible.
(Something of a side note: the thing I am most looking forward to most is how well classic and legacy results — often underpowered, frequently not replicable — will stand up to consistency testing. Of course, the raw data behind some psychology paper from the 60s, 70s or 80s will be long gone, but the presence of inconsistencies will be interesting regardless.)
“What are the dangers of ‘data vigilantism’?”
Glib answer: I don’t know. What are the dangers of inconsistent research?
Long answer: This question obviously means ‘are you encouraging people to be needlessly critical by publicising a method that encourages them to question published findings?’
Obviously, my answer to that would be ‘no’. I think that risk is pretty minor, for a few reasons.
Firstly, we’ve been very clear in the paper, and in anything we’ve written about it elsewhere, and here, that GRIM tests are evidence of inconsistency and not malfeasance.
Secondly, the GRIM test is absolute. It doesn’t return a probability that something funny is happening, it establishes that at the very least a result has been incorrectly reported. Vigilantism implies a loss of proportionality and standard of evidence. GRIM errors at least have the decency to be definitive.
Thirdly, when you try it out a few times, you’ll find that patterns of GRIM errors detect trouble a little like Potter Stewart detected pornography: you know it when you see it. We’ve seen a few papers where most of the results are wrong. Results from multiple groups, multiple cells, multiple experiments. More than that, there’s absolutely no hint whatsoever as to why this is the case. Something is very definitely ‘off’.
Fourthly, scrutiny subsequent to failing the GRIM test can always be addressed by releasing the data behind the figures. It’s a test of the nature of the data itself, it doesn’t require external proof or validation. So, questions the test raises can be put to bed with just the numbers.
Fifth, the method involves no arcane or complicated mathematics, so you can’t blindside anyone with it. I remember a quote from a few years ago about Uri Simonsohn’s test for detecting data manipulation:
“In this case it has been used like a medieval instrument of torture: the accused is forced to confess by being subjected to an onslaught of vicious p-values which he does not understand.”
This is unlikely to happen here, unless the GRIM test is leveled at someone who can’t count their toes. The eponymous author who blinks in non-comprehension as their results get assaulted by a process they don’t understand is not invited to this party. The method is just too simple.
Speaking of which…
“It’s a bit simple, isn’t it?”
One of the reasons we pre-printed this paper was to save ourselves the embarrassment of having ‘re-discovered’ some old technique that had passed us by, and then having to edit a formal paper to say “Apparently, we never read widely enough to know we weren’t the first people to try this”. I’m still paranoid we’ll see this happen yet, as some reference to the test in an obscure meta-science paper from the 70’s is unearthed.
(However, as we both read papers like that and haven’t found anything relevant, and neither has anyone else, we might be OK.)
“I did the GRIM test on a paper, and something is wrong — what should I do?”
Hard to say out of context. Here’s what I’d do. Mileage may vary.
First of all, re-check it and assume nothing. We managed a few times to misunderstand a paper completely, and found later that the inconsistencies we found were mistakes of our own.
Second of all, if you are certain you have uncovered inconsistencies, how many are there, and how much of the paper do they cover? Individual typos or a few poky values may be indicative of nothing but the fact that some people have forgotten how to round numbers properly or that numeric keypad keys are super close together.
Third, if you are satisfied there is a case to answer because of the extent or proportion of inconsistencies, write to the author(s). Be collegial and straightforward, ask for things nicely, demand nothing.
Priority through whole process: improve science.
Not at all priorities: unpleasantness, imperiousness, high-handedness, or an outlet for your unresolved childhood issues with perceived authority.
“GRIM test? Really? Who came up with that? Is one of you into heavy metal or something?”
Answers respectively: Yes. Yes. Me. Yes, me.
“What are you doing next?”
Don’t know yet. Still finishing this, technically. The time period including “next” starts post-publication.
P.S. One last thing, if you’re writing about this in any popular or commercial publication, please refer to the pre-print as “Brown and Heathers (2016)”, not any other variant, or any other order of authorship. This sort of thing is important in academia.
P.P.S. Thank you for being part of the sustained and growing interest this paper is receiving. It is both gratifying and good fun.
P.P.P.S. As per usual:
My Medium (it has longer writing)
My Facebook (it has shorter writing and snark)
My Podcast (it has science… and snark)