Statistics 101: The MAGIC criteria
About 100 years ago, Ronald Fisher introduced statistical hypothesis testing. Fisher knew what he was doing and, in the situations he was involved in (testing fertilizers and such), what he was doing made sense. But those methods got applied much too widely and people quickly started complaining.
More recently, the complaining has started to take effect, with important groups noting the problems with significance testing and p values. But … if not that, what?
The MAGIC criteria are put forth in Statistics as Principled Argument by Robert Abelson. It’s an easy read, with few formulas but lots of wisdom. I urge those interested in this stuff to go buy a copy.
Abelson lists five criteria by which to judge a statistical argument. He calls them the MAGIC criteria
1. Magnitude How big is the effect?
2. Articulation How precisely stated is it?
3. Generality How widely does it apply?
4. Interesting How interesting is it?
5. Credibility How believable is it?
We can tell how big an effect is through various measures of effect size. We will get into some of these in later diaries, but some of the common ones are correlation coefficients, the difference between two means, and regression coefficients. Big effects are impressive. Small effects are not. How big is big depends on context, and on what we already know. If we find, for example, that a new diet plan lets people lose (on average) 10 pounds in a month, that’s pretty big. 10 ounces in a month is pretty small. But if it was a diet tested on rats, 10 ounces might be a lot.
Articulation is measured in what Abelson calls Ticks and Buts. A ‘tick’ is a statement, and a ‘but’ is an exception. The more ticks the better, the fewer buts the better. There are also blobs, which are masses of undifferentiated results. Blobs are, as you might have guessed, bad.
Generality refers to how general an effect is. Does it apply to all humans everywhere? That would be very general. Or does it apply only to people who have posted 50 or more diaries on dailyKos? That would be pretty specific. Usually, more general effects are of greater value than more specific ones, but you should be sure that the study states how general it is.
Interestingness is very hard to measure precisely, but one way is to say how different the reported effect size is from what we thought it would be. For example, I once read a study that showed that Black people, on average, earn less that White people. Upsetting, but not interesting. I knew that already, and the size of the difference was large (which I thought it would be) but not huge (which I also knew, because, after all, even the average White person doesn’t earn all that much). But then it went on to say that, while Black men earned a lot less than White men (more than I thought the difference would be), Black women and White women earned almost the same (that’s really interesting! I would have thought that Black women earned much less than Whites!)
Finally, credibility. The more hard a result is to believe, the more stringent you have to be about the evidence supporting it.