SPRITE Case Study #3: Soup Is Good (Albeit Extremely Confusing) Food
With apologies to Jello Biafra.
I know this one is a little late, but I have been by degrees [a] out of town, [b] unbelievably ill, and [c] very busy.
Also, as you’ll shortly see, this one is going to take a while, because we’re going to deal with some more involved SPRITE issues, specifically:
(1) assumptive calculation (2) a between-subjects, 2*2 ANOVA.
I feel old just looking at these words, because they’re code for “this is going to get fiddly as all hell”, and I’m extremely conscious of boring you. Rest assured, there will be as many unnecessary and oblique jokes into the following as possible.
And, of course, there will be soup jokes. Why soup jokes?
“Out of Sight, Out of Mind”: Pantry Stockpiling and Brand-Usage Frequency.
Wansink and Deshpande, 1994.
…“Each subject was then given a booklet with instructions that manipulated their usage salience of canned soup. Usage-related salience was manipulated by asking subjects in the high usage-related salience condition to describe the last time they ate or served canned soup…
and the thoughts they had when doing so…”
This paper is from a marketing journal called Marketing Letters, it was published in 1994, and it has 84 citations at present, and — you might have guessed this by now, though — it is NOT in amazing shape.
Let’s start with a quick “carrots” test.
Consider the mean / SD in green. What is it?
“To make their inventory levels vivid, subjects in both stockpiling and the nonstockpiling conditions were asked to write down the number of cans of soup that they visualized having in inventory.”
This is a bit of a funny measure, as it isn’t 100% clear if this is a hypothetical pantry, an idealized pantry at home, or their actual real pantry that they’ll see when they walk through the door and kick their boots off. But it doesn’t matter to the numbers. It’s all soup to me.
We also don’t have a cell size. As there are 191 participants who report serving soup, and we split those into 4 groups, we can assume n ~=191/4 , or 48. We’ll use this assumption the whole time, as fiddling with it won’t give us the latitude to change much.
So, let’s ask SPRITE to generate some possible distributions for the combination:
mean = 8.0
SD = 3.3
lbound = 1 (assumed; these are people who report serving it)
ubound = infinite (assumed; even though Infinite Soup sounds like a bad pop punk band from the embarrassing part of the 90s)
What’s that look like?
You know what? That is 100% fine.
I mean, having more than 12 cans of soup seems a bit weird to me personally… but I might think differently if I had 5 kids and a walk-in pantry.
Or if I ran a smouldering-hot soup kitchen.
The only problem is…
3.3 isn’t the standard deviation, it’s the standard error. Whoops.
So, let’s get the SD, which will be sqrt(48)*3.3…
Which is 22.9.
Now, don’t get me wrong. This is totally possible. We have an unrestricted upper bound, and someone might work for Campbell’s. The only problem is …
What’s described is some kind of post-Soviet nightmare, where most people get a month’s soup ration (“ONE CAN ONLY. END OF THE LINE.”) but party insiders get to gorge on enough to give Andy Warhol a headache. Specifically, 40-something people all reporting having ONE can, and then everyone else owns somewhere from a cupboard full of Campbell’s to having a one-bedroom bungalow complete with indoor plumbing literally made from tins of Cream of Mushroom.
Similar solutions? Well, they’re similarly ridiculous.
So, why did I lead you astray with the SD thing?
Because we make some very charitable assumptions when we check papers for inconsistencies, and it’s appropriate to do so. One of the most common ones is that the authors have confused SD and SEM — quite often, just this simple typo can explain why all observable statistics in a whole paper are irreparably busted.
In other words, we choose between:
A) the paper says SEM correctly, data is seven shades of wacky
B) the paper says SEM incorrectly, is actually SD, data is reasonable
It is responsible to check both of these, so we do that. This is assumptive calculation, it’s the Steel Man version of checking statistics for veracity. As far as I’m concerned, this is the only fair way to do business here.
So, there’s still lots more values available in the paper. Let’s test some more and see if we can resolve anything else.
The ‘soup attitude’ measure in green is a neat value pair. Let’s test that one.
On the top here is the SD assumption, on the bottom is the SE assumption:
So, the top panel (m=5.5,sd=1.5) looks pretty damned normal, but the bottom panel (m=5.5,sd=10.4!) has what I call the ‘horns of no confidence’, which is where an incorrect, impossible or unlikely value set has all its constituents stacked into its highest or lowest bins to try meet a ludicrously high SD. In fact, SPRITE doesn’t even get halfway to a solution. I asked for SD = 10.4, but it topped out entirely at SD = 4.01… there’s no more deviation left to get.
Anyway, sod this — let’s get real and do the lot. In other words, we jam SPRITE on ALL the 20 values I’ve listed in the table above, and see what the distributions look like. Any which can exist I’ll mark blue, any which can’t I’ll mark in red.
And here we go with…
The horns! The horns!
The SEs just aren’t possible.
And that brings us to the really big problems.
1. The SDs aren’t that likely either
Some of those distributions, possible as they may be, are verging on a little silly.
2. The ANOVA from the SDs doesn’t return the right F values
It’s easy to run an ANOVA from summary statistics like this, you can do it in Excel or on the back of an envelope. Because it isn’t 1965, I have a little scriptlet which does it for me. Let’s just take the first line of these values, which I’ve highlighted.
Now, if I hoover up the means and SDs from the left in red, put them in my summary 2x2 ANOVA calculator, I get for the green values:
F=16.4, p=7e-5; F=1.4, p=0.23; F=2.8, p=0.1
In other words, the pattern of significance is the same… but the values are WAY off. In fact, if we divide all those numbers by three…
F=5.5, F=0.5, F=0.9
Wait a minute — that’s really close! Close enough that if we fiddled with our values a bit, we might actually get the right answer?
Let’s try the next line:
F=13.7, p=3e-4; F=0.7, p=0.4; F=4.2, p=0.04
Divided by three for no reason, though:
F=4.6, F=0.2, F=1.4
Doesn’t quite work. Am I even calculating the right things??
3. The rest of the text contradicts the table.
I didn’t read the text carefully enough. Nick Brown, who kindly agreed to read this for me to make sure I wasn’t going mad, pointed out:
“If we focus only on the two columns in Table I where usage-related salience is high (columns 2 and 4), we see that subjects in the stockpiling condition more strongly believed (6.4 versus 5.2) that “canned soup goes well with other foods” F(1,94) = 5.6; p < .05) and more strongly believed (7.5 versus 6.5) that “canned soup can be eaten at any time” F(1,94) = 5.7; p < .05).”
While the table describes the right-hand side columns as ‘ANOVA RESULTS’ listed overall for ‘stockpiling effect’, the text contradicts the table as it implies that the ANOVA is restricted to JUST the 2nd and 4th values. **Footnote 1**
The only problem is, if you run that ANOVA (which is a t-test in a costume mask, really), you get F(1,94)=18.1… which is an even worse solution!
However, the above descriptions just cover the lower half of the table — what about the figures in the upper half, the ones we started with?
Subjects who visualized a stockpiled pantry and who had high usage related salience estimated they would use approximately twice as many cans each month (X = 8.0 cans) when compared with subjects with low levels of usage related salience (X = 3.7 cans) or when compared with subjects who had nonstockpiled levels of inventory (X = 4.7 cans). This interaction was significant for both the number of cans they expected to use in the upcoming month F(1,187) = 6.3; p < .01, as well as for the number of occasions in which they estimated they would eat soup F(1,187)= 6.0; p < .01).
Now, that IS what’s described in the relevant section of the table. So, the text and table agree with each other, however instead of these F-values:
F=10.0, F=33.4, F=30.4
F=3.5, F=8.2, F=8.3
So, it’s consistent from text to table… but I can’t see how it’s correct.
Something went badly, badly wrong here.
- The numbers stated to be standard errors are either unlikely or, more commonly, impossible.
- Even if we charitably assume the authors meant SDs, a few of the distributions look unusual and concerning.
- The F-values don’t match the descriptive statistics.
- The descriptions of the values in text don’t match the table.
Do I know what the hell is going on here? Not really.
At some point, an investigation like this gets bogged down by all the interlocking pieces not fitting together, so much so that I run the very real risk of saying something silly myself. There’s no solid ground to stand on.
So, that being said, I am 100% happy to admit I’m wrong about some of the details. But I’m also 100% certain I’m not wrong about all of them.
Finally, a truly sobering note: this is obviously part of a larger body of work
which is gathering serious momentum.
34 publications from Wansink which are alleged to contain minor to very serious issues,
which have been cited over 3300 times,
are published in over 20 different journals, and in 8 books,
spanning over 19 years of research.
And those numbers all need updating.
More billingsgate and heresy at:
**Footnote 1** — This is a patently ridiculous situation. In a post like this, I try to minimize anything which (a) interferes with the narrative of checking a paper, which is already a complex pain in the arse, and (b) have nothing to do with SPRITE out of posts like this, but not this time. The authors have reported this ANOVA so badly it’s almost impressive. The main effects are confused with the in-group effects. There are no planned contrasts. There is no mention of what happens to the error rate. All of this is rote in second year undergraduate statistics, not mystical secrets conferred on a mountaintop by a chap with big eyebrows in a loincloth.
More embarrassing still is the failure of the reviewers to notice anything untoward, and on the very summit of Mount Embarrassment, the fact that this has been in print since I was in primary school, cited almost 90 times, and no-one ever managed to point out the very obvious fact that it was stuffed. This is exactly the kind of example I think of when someone tells me, smugly, ‘science is self-correcting’. It bloody well is not, unless we correct it.