Another pizzagate correction has been posted that I think is worthwhile to go over.
Here is a link to the PDF of the correction.
Essentially the correction just adds clarifications to a table which I had flagged here for granularity problems:
All of the sample sizes in the correction are now consistent with the percents, so presumably they are correct, but without the data set it is hard to say for sure.
Missing data is likely one of the more common causes of granularity inconsistencies, and it isn’t clear how serious these reporting omissions are. I guess it depends on the implications.
Take for example a survey. Inevitably some people will skip questions. Does that matter? Well, maybe. Let’s say we survey scientists, and one of the questions is whether they committed fraud. It just so happens that the question about fraud is the only one people skip. That’s interesting, right? Shouldn’t that be indicated somehow in a report on the survey?
And what if the scientists who skip the question about fraud are more likely to skip another question? That could be a hidden moderator, and it would only be identified by complete reporting (or open data).
In contrast to survey questions, there isn’t really a good reason why data collected by the researchers themselves should be incomplete, which is the case with the present article. Why are observations missing? Is it really that hard to mark whether someone is using a fork or chopsticks? When collected data is missing it raises questions about the collection procedure.
Although it appears the issues with this table turned out to be minor, as you might expect if you are familiar with this lab’s work, the granularity problems led me to take a closer look at the study and identify other much more serious issues.
I just happened to watch this video while hanging out with a friend:
At around 22:40 there is a slide about Chinese buffets:
One thing that jumps out is that they looked at 70 variables! But hey, we already knew that p-hacking was part of this lab’s standard operating procedures.
Just below the p-hacking is the mention of “secret agent tools” such as weight mats and lasers that were used to estimate the BMI of the diners. I found that interesting, because that description of the methods directly contradicts what was written in the paper, namely that BMI was estimated by a visual method.
But let’s give the researchers the benefit of the doubt. The video states there were 12 coders and 370 diners, while the paper claims there were 22 observers and 213 diners. So maybe there were multiple Chinese buffet studies that gave the same results, and one time they used “secret agent tools”, but didn’t publish that one.
So…I’m not proud about this, but I’ve read Wansink’s books, and happen to remember reading about these tools in Slim by Design:
How did we get people to reveal their BMI stats and all that other info? If James Bond had been running this study, he’d have placed pressure-sensitive mats by the door so that when people walked in, their weight would be silently recorded. To get their heights, he would aim a laser beam grid at the buffet to give a reading to the nearest eight of an inch. To time their behavior, he would have multiple Swiss stopwatches with lots of clicky buttons on them. To count how many buffet trips they made or how many times they chewed, he’d use a bunch of super-expensive precision German stadium tally counters by Zeiss or Leica.
That’s how James Bond would have done it. But we didn’t have a genius inventor named “Q” or a limitless high-tech British budget. So instead of being James Bond, we ended up channeling Wile E. Coyote. Remember the Road Runner cartoons? Wile E. was the somewhat dopey coyote who was always trying to catch the Road Runner with the worthless gadgets he bought from the Acme Corporation. Nothing ever, ever, ever worked — every one of them was a catastrophic failure at the worst possible time. Acme Corporation — that’s where we shopped.
That pressure-sensitive mat? It might have been able to detect a car that rolled on it and parked overnight, but it couldn’t detect a person walking over it at one mile per hour. That cool laser beam grid? After three consecutive nightmares that we were blinding people as they served themselves General Tso’s chicken, I set up a sayonara eBay account to sell it. Swiss stopwatches? Neither Swiss, nor did they usually stop when necessary. German stadium tally counters? Couldn’t afford anything German, so we bought one from a country I don’t think had been invented when most of us were born. Worked perfectly.
Despite a bumpy first week, we finally settled into a two-month groove. We developed a cool way to match people’s body shape to standardized charts that helped us classify their weight and height, and we developed detailed coding sheets to track 103 different visible characteristics and behaviors of each diner. Then we crunched the data to see what slim diners did differently from heavy ones.
Hmm, this passage mentions 103 variables instead of 70. Is there a prize for p-hacking that reaches triple digits?
So they clearly didn’t use “secret agent tools”. I’m actually not even sure how seriously to take this passage, it might just be for comedic effect. Surely they didn’t actually buy lasers to shine at people, right? Right?
So what’s up with the “secret agent tools” mentioned in his talk? Was that also for entertainment? If it was a joke these people didn’t pick up on it and believe that’s exactly how the study was done. Surely misrepresenting methods in a paper is scientific misconduct (Todd Shackelford would disagree). But is deliberately misrepresenting a study during a talk misconduct? I’m not sure, but I think we can assume where Cornell will fall on the issue.
These revelations bring us back to the titular question, how serious was this correction? The granularity problems could have had a number of innocent explanations such as incorrect rounding, typos, missing data, etc. But regardless of the cause, the problems suggest some extent of sloppiness in the work.
We have to keep in mind that the numbers in a paper are the very, very tip of an iceberg. We aren’t seeing how they were calculated, we aren’t seeing the data, we aren’t seeing how the data was collected, we aren’t seeing how the study was designed, we aren’t seeing how the study was conceived.
Problems in the top level could be contained there, or they could be the result of a cascade of problems from lower levels. In this case the issue was missing data, and the missing data suggests problems with the data collection.
What is the source of this problem with the data collection? Perhaps the observers were so busy coding 103 variables that they missed a few.
When there’s smoke firefighters know to take it seriously. Reporting errors are smoke, and we need more firefighters.
While performing research for this blog post I came across this paper.
That paper states:
Thirty trained observers recorded the demographic characteristics and behavior of 303 diners (165 men, 158 women)
Those numbers add up to 323, not 303.
Normally this is a type of error that I wouldn’t view as too serious, and I would just look through the paper to figure out what the actual sample size is. But of course these authors don’t list sample sizes anywhere else, and they don’t even provide degrees of freedom.
As a result, I have no idea what the sample size is. One thing I can do is infer the degrees of freedom, and thus sample size, from the F-statistics and p-values they provide. They provide one p-value to three decimals: .006, with an F-value of 7.74. This implies the sample size is between 99 and 1029. However, that of course assumes the F-value and p-value are reported correctly.
Literally the only number in the paper that I can check is wrong, so I have no reason to believe the rest of the numbers are correct.