Why am I doing this? Don’t I have better things to do?
The whole thing is steeped in irony.
It wasn’t that long ago I was a PhD student sitting in a room full of inept professors with pompous, superfluous titles who ended my academic career, and now here I am returning the favor.
If our current political climate has taught us anything, it is that if you have the ability and platform to speak out, you should.
When I was a student there was nothing I could do about all the terrible science around me, or the mistreatment of trainees. But now I can.
I don’t go out looking for bad science, but if it finds me should I ignore it and hope someone else deals with it?
People talk about the self-correcting nature of science as if it is some sort of natural law.
No, science corrects because scientists correct it.
Have you ever thought about how the smallest decisions can affect your life?
In this case a famous researcher wrote what he thought was an innocent blog post providing advice on how to be successful in academia.
The post went unnoticed for a month until one of his colleagues found it and emailed it around.
The post then got shared on Twitter and I noticed it.
I then had lunch with my friend and told him about this crazy post I had just read. He asked if I had checked the papers with granularity testing. I hadn’t.
As soon as I got home I looked at the papers and sent out this tweet:
The rest is history.
Along with Tim van der Zee and Nick Brown I wrote a preprint detailing over 150 inconsistencies in the papers mentioned in the blog post.
The preprint was downloaded 3,000 times and has an Altmetric of 160.
It inspired four posts from the prominent statistician Andrew Gelman.
The researcher claims corrections to the papers will be issued.
I don’t know Brian Wansink, and was unfamiliar with his work until I saw his blog post. I have no interest in food psychology, or psychology in general.
But I am interested in how academia selects for bad science, is free from any outside regulations that might prevent a crisis like the housing bubble, and how its power structure allows senior members to behave like dictators.
Brian’s blog post somehow managed to touch on all of these subjects. I don’t know how Brian runs his lab, or how carefully his work is done, all I know is what I can see.
And what I see I’ve seen before, only close up. And others see it, have seen it, or will see it. Science in academia is not about performing science, it is about your brand.
The Cornell Food and Brand Lab could not be more aptly named.
Actually this whole story is too good to be true.
We are in the midst of a reproduciblity crisis in science, and Brian writes a post not just admitting to questionable research practices, but presenting them as the ideal way to perform science and bragging about how many publications they led to.
It was so unbelievable the first comment on the blog post asked if it was satire. This isn’t the first time someone thought Brian’s work was satire. His work was once initially mistaken as an April Fools’ joke.
However, I knew it wasn’t satire because as I said, I’ve seen this before. In fact, I immediately went to archive the page in case it got taken down, but someone beat me to it:
If you were to go into the lab and create someone that perfectly embodied all the problems science is currently facing you couldn’t do better than Brian Wansink. It’s like how talk show hosts couldn’t dream up a better presidential candidate than Donald Trump.
Actually, the parallels with Trump are striking.
Just as Trump has the best words and huge ideas, Wansink has “cool data” that is “tremendously proprietary”.
Trump’s inauguration was the most viewed in history period, and Wansink doesn’t p-hack, he performs “deep data dives”.
Trump’s policies may be influenced by financial interests. Wansink has done work for McDonald’s, and here he is on Twitter:
Trump doesn’t let facts or data get in his way, and neither does Wansink. When Plan A fails, he just moves on to Plan B, Plan C…Plan ? And when it is discovered his papers contain dozens of errors he doesn’t think “the significance levels will be different at all” once they are fixed.
The appearance of being scientific can increase persuasiveness.
Brian Wansink is known as the “Sherlock Holmes of food”. I like the BBC show so I take offense to this. From where I’m standing it seems the “Donald Trump of food” would be a more appropriate moniker. And hey, I needed a title for this post.
This post isn’t about Brian Wansink — well, technically it is — it is about what he represents, or at least appears to represent. He is not the only researcher to remind me of Trump, and I hope this post helps others to recognize the problems around them or encourages others to speak out.
I will say one thing, unlike Trump and other researchers I know, Wansink has been cordial in his discussions, and as far as I can tell seems like a nice guy. Hopefully he learns from this experience and uses his influence to spread the manifesto of reproducible science.
As it stands now, the wrong papers get published, the wrong researchers get funded. There is no incentive to share data or perform careful science. The only thing that matters is your brand, and your ability to leverage that brand into publications and grants, which circle back to feed the brand. If that means performing sloppy research, exaggerating results, and then refusing to acknowledge any errors, so be it.
But this isn’t how science should be done.
Brian Wansink admitted to taking a study that got “null results” and exploring the data until he got four papers out of it. The papers were published in obscure journals, and were peppered with self-citations, but conspicuously did not cite each other.
When readers of his blog raised concerns about both the questionable research practices employed and work environment described, Wansink agreed with the commenters and posted an addendum. Despite acknowledging the readers’ concerns, he somehow performed enough mental gymnastics to convince himself the problems he agreed exist didn’t apply to him.
I imagine he thought the storm had passed at that point, but he didn’t plan on someone like me coming along.
After my collaborators and I discovered the unprecedented levels of inconsistencies in his papers, we emailed two of the corresponding authors to see if we could get access to the data. We received no reply.
When I get questions about my research I respond immediately, which should be the norm.
We then emailed the Cornell Food and Brand Lab directly and finally did get a reply that explained because of the IRB we would have to get approval to see the data (I think they assumed this collaboration would lead to a fifth, or sixth publication with this data set). Umm, I’m not sure why the data can’t be anonymized, but whatever, we were willing to go through the IRB approval process. However, when we replied that we had identified some problems with their papers and were hoping to see what had happened, we received no response. After posting our preprint we sent them a courtesy email to let them know about it. Again, no response.
The preprint was downloaded 2,000 times in the first day, and researchers posted to Wansink’s blog, PubPeer, and tried to engage him on Twitter.
Wansink continued to tweet and write blog posts like nothing had happened.
An entire week after our preprint was posted Wansink suddenly commented on his blog and PubPeer. He said that once he learned of the inconsistencies he contacted all of the editors to see if he can make the necessary corrections. This simply isn’t true since we notified his lab there were problems with the papers two weeks before posting our preprint, and he was made aware of the preprint as soon as it was posted.
Some may argue that when we contacted the lab we should have been more forthcoming with what we had found or provided them a draft of our preprint. However, the errors were so clear cut we didn’t feel we needed their input, and besides, for all we knew they weren’t even reading our emails.
My feeling is that Wansink was hoping the situation would blow over, and only decided to respond once he was contacted by journalists. If someone emailed you notifying you of errors in your papers wouldn’t you be interested to know what those errors were? Journalists shouldn’t have to get involved to get a response.
And when he finally did acknowledge there were problems he referred to them as “minor inconsistencies”. There are over 150 inconsistencies in these four papers, and I honestly have no idea if any of the numbers are correct. If these are minor problems I would hate to see what large problems look like.
Perhaps Andrew Gelman articulated it best:
Let me put it this way. At some point, there must be some threshold where even Brian Wansink might think that a published paper of his might be in error — by which I mean wrong, really wrong, not science, data not providing evidence for the conclusions. What I want to know is, what is this threshold? We already know that it’s not enough to have 15 or 20 comments on Wansink’s own blog slamming him for using bad methods, and that it’s not enough when a careful outside research team finds 150 errors in the papers. So what would it take? 50 negative blog comments? An outside team finding 300 errors? What about 400? Would that be enough? If the outsiders had found 400 errors in Wansink’s papers, then would he think that maybe he’d made some serious errors.
Like Gelman, I was curious to find out, so I took a look at some more papers. I looked at papers by Wansink with the most citations and which appeared the easiest to check for granularity inconsistencies and test statistic errors. My analysis is by no means comprehensive.
I am aware that Wansink has updated his Addendum II (his third update to his blog post), and he seems to recognize there are problems with how his lab has performed research and to express a desire to do better. I hope this is the case.
It’s just hard to know what to believe. After being confronted with a litany of errors in his papers he proclaims to run a “group that’s accurate to the 3rd decimal point”. It’s also hard to hear what he’s saying over all this quacking.
The point of exposing these further errors is not to get these papers retracted. I don’t know if these papers are actually wrong, or if the inconsistencies I’ve found are even errors, and instead are simply typos or a misunderstanding on my part.
Most of the literature is wrong, this is just a reminder that we need to be vigilant. It is also your daily reminder that peer review is useless and everyone should instead be preprinting their work.
I am choosing to share this small sample of inconsistencies in a blog post since that is the fastest means of scientific communication. But, as you will see, I already have the data nicely formatted in LaTeX and can share the results in a more formal medium if necessary. I actually wanted to share my pizza error findings as quickly as possible, but my collaborators wanted to practice restraint, contact the lab to try and get the data, submit a formal publication, etc., etc. But now that the preprint is out the shackles are off.
LET’S DO THIS!
“The office candy dish: proximity’s influence on estimated and actual consumption”
Google Scholar citations: 203
This study design was a little complex. There were 40 secretaries, but they were divided into 4 groups of 10, and over a period of 4 weeks the 4 groups were rotated through different conditions. Long story short, the sample size is 40, they were asked Likert style questions, so this paper is suitable for granularity testing.
When you have data reported to whole numbers and a sample size of 40, the only possible decimal fractions a mean can end in are:
As a result, if you round up the only possible numbers in the second decimal place are 0, 3, 5, and 8. If instead you perform bankers’ rounding you can get 0, 2, 5, and 8 in the second place. And if you perform random rounding you can get 0, 2, 3, 5, 7, 8, in the second decimal place.
Below I reproduced Table 1 for your viewing pleasure, and allowed random rounding.
Even allowing for random rounding there are a large number of impossible means. If you are rounding with a computer program, rounding should not be random, so are they not using a computer? (that would explain a lot) As an extra treat, there are a few impossible standard deviations as well.
I don’t know what happened to cause these errors, maybe they lost a couple responses, or maybe they are doing calculations by hand.
“Eating Behavior and Obesity at Chinese Buffets”
Google Scholar citations: 58
This study is pretty simple. They observed the habits of diners at various buffets and categorized the diners by their BMI. One thing people might not realize is when you provide a percent such as 71.0% that is actually the fraction .710, which makes granularity testing extremely effective.
“The Flat-Rate Pricing Paradox: Conflicting Effects of ‘All-You-Can-Eat’ Buffet Pricing”
Google Scholar citations: 57
Let’s do an interesting one. No, I’m not talking about the paper, I’m talking about the errors ;)
This table’s actually not that bad, only a few granularity errors, and they actually got the degrees correct on the ANOVA test, and the only incorrect F statistic is barely wrong.
Can you see where the error is going to be?
If you multiply the “Actual number of pizza consumed” by “Dollars paid per slice of pizza consumed” you should get the price of the buffet, but you don’t.
The half-price buffet cost $2.99, while the regular price buffet was $5.98.
2.95 * 1.33 = 3.92, not 2.99
4.09 * 1.98 = 8.10, not 5.98
For whatever reason, these types of internal inconsistencies are very common in work from this group.
James Lawrence points out in the comments that these inconsistencies can be explained mathematically. This is correct, but with that strategy if a diner consumes a very small amount of pizza they will dramatically skew the average value. I’m not sure if standard statistical tests should be performed on data after a nonlinear transform.
“Bad Popcorn in Big Buckets: Portion Size Can Inﬂuence Intake as Much as Taste”
Google Scholar citations: 335
In the words of Andrew Gelman, this was a “barbaric” study. They fed moviegoers either fresh or 14-day-old popcorn in two different container sizes.
Below is the main table from the paper:
The total number of moviegoers is 157, as a result the df should be 157 – 4 = 153, not 154. There is also an entire row with incorrect ANOVA values.
More — or less, depending on your tastes — concerning is that an entire column is mislabeled and the column label “Freshness” is missing. I assume this was a copy editing error by the journal, but considering how short this paper is and the importance of this table, you would think the authors would have caught it in the proof.
“Ice Cream Illusions Bowls, Spoons, and Self-Served Portion Sizes”
Google Scholar citations: 259
Here we have the famous ice cream study. I think you can guess what we’re going to find.
Yep, our old friends granularity errors and incorrect test statistics.
You would think that if they got a 0.00 for an effect they might do a double take.
We also have internal inconsistencies again!
The “Average ice cream per spoonful” * “Number (of spoonfuls)” should equal “Actual volume served”.
Let’s take a look.
Small bowl, small spoon:
2.00 * 2.22 = 4.44, not 4.38
Small bowl, large spoon:
2.79 * 1.90 = 5.30, not 5.07
Large bowl, small spoon:
2.04 * 2.94 = 6.00, not 5.81
Large bowl, large spoon:
3.35 * 2.09 = 7.00, not 6.58
Rounding uncertainty cannot account for these differences.
As mentioned in the addendum above, there is a method for reconstructing these values.
“How descriptive food names bias sensory perceptions in restaurants”
Google Scholar citations: 198
Let’s do one more.
Same old, same old. The total number of diners is 140, and yet they somehow report the degrees as 133 for a simple two sample ANOVA.
It has been brought to my attention that the ANOVAs in Table 1 are not simple two-sample ANOVAs. As a result, I am unsure what the values should be and have unmarked them. Interestingly, in this draft version of the article the degrees are 131 instead of 133, and all the F statistics are listed as 5.92. In addition, my colleague tells me the Chi-square values are wrong, so I marked those red.
Concerns have been raised that the statistics in Tables 1 and 2 are derived from models, and therefore are not suitable for granularity testing. This type of concern has been raised before, and from my research into Dr. Wansink’s work I have noticed that he typically notes when statistics are derived from models, and as a result I assumed these values were arithmetic values.
It has also been suggested that the models in Table 2 may contain dummy variables for the different foods. It is not clear from the paper if this is the case, and the preprint of the paper does not have a Table 2. The degrees in Table 2 do suggest a complicated model was used, but I suspect the degrees in Table 2 were simply copied from Table 1.
As a result, many of these reported inconsistencies could be explained away if you want to give the authors the benefit of the doubt. However, because there are dozens of papers by Wansink under investigation, I am finding this difficult to do. If it turns out my assumptions for this paper were wrong I apologize.
To be fair to Wansink, once in a while I do come across a paper that seems to be accurate to the “3rd decimal point”. I say “seems” because without access to the data sets it is impossible to say for sure.
I would request some of these data sets to try and figure out what happened, but I can only assume they are also “tremendously proprietary”.
Susan Fiske recently wrote:
Psychology is not in crisis, contrary to popular rumor.
Perhaps the errors reported in this blog post aren’t enough to change her mind. But these were only a fraction of the errors found in one researcher’s work.