The current flavor of the week for bad science is the work of the Cornell Food and Brand Lab.
Their research head freely admitted to questionable research practices in a blog post, my colleagues and I found an unbelievable number of errors in 4 of their publications, I found errors in 6 more of their papers, Andrew Gelman detailed problems with their methodology, we learned the lab has been using incorrect statistics, making errors, and sweeping them under the rug for years, and the story has been covered by Retraction Watch, Slate, and New York Magazine, with numerous other media organizations chomping at the bit to get a story out.
I’m tired of talking about this story, as is Gelman. I assure you I don’t enjoy going through their research or reading their papers. But my colleagues and I were asked if we found any problems with work by the lab that impacted public policy. Because you see, the lab’s Center for Behavioral Economics in Child Nutrition Programs division has had an impact on school lunchrooms.
I really didn’t want to find any more errors, but as a public service my colleagues and I took a look at some of their work funded by the USDA. Many studies either didn’t contain any means or SDs for us to check, or contained sample sizes too large for us to apply our methods. However, my colleagues did flag this paper:
“Attractive names sustain increased vegetable intake in schools”
Google scholar citations: 97
Below I reproduced some of the key information in the first table.
The table seems innocent enough. There aren’t enough decimal places to apply granularity testing. But my colleagues noticed that the text claims this study involved 113 students.
32+38+45 = 115
Here we go again…
Take a good look at that table. Take it in like a tall glass of water.
Number eaten + number uneaten != number taken
11.3+ 6.7 = 18.0, they reported 17.1
4.7 + 10.3 = 15.0, they reported 14.6
6.8 + 13.2 = 20.0, they reported 19.4
I’m afraid to do it, but let’s move on to the second table.
This table is a complete disaster.
Just a quick glance reveals that 7/8 of the % changes don’t make any sense.
(.054-.018)/.018*100 = 200.0%, they report 99.0%
(.073-.021)/.021*100 = 247.6%, they report 109.4%
(.033-.002)/.002*100 = 1550.0%, they report 176.9%
(.062-.086)/.086*100 = -27.9%, they report -16.2%
(.018-.120)/.120*100 = -85.0%, they report -73.3%
(.099-.047)/.047*100 = 110.6%, they report 35.7%
(.046-.030)/.030*100 = 53.3%, they report 41.5%
Addendum to the Addendum of the Addendum 20170216
Through email discussions and a lively discussion on Facebook, I am now able to determine how the percent changes in this table were calculated.
It is a clusterfuck, here we go.
The authors are not using the percentage change formula, they are using the percentage difference formula.
The percentage difference formula will reproduce the first column of values, but not the second column of values. It is impossible to get negative results with the percentage difference formula if both of your percents are positive.
The first two rows of the second column of percent changes are calculated with this formula:
% difference * -0.5
The third row of the second column of percent changes is calculated with:
% difference * 0.5
The fourth row of the second column of percent changes is just % difference.
Therefore, the first “% change” column label should say “% difference”, and the second “% change” column label should say “WTF”. I also have concerns about whether percentage difference is what should be used here. My other criticisms still stand.
And the problems with this table are far from over.
One thing that stands out is the fraction for “All hot vegetables” is not the max value in each column. This is actually explainable if you assume that each vegetable is not served on each day. For example, if a popular vegetable such as broccoli is only served 70% of the time you could easily get the numbers they report.
What is not explainable are the standard deviations they report.
I previously made a historical discovery for variances and standard deviations, so I’m something of a standard deviation connoisseur.
The first thing that jumps out is that the standard deviations are far larger than the fractions. At first I thought maybe they took the fraction for each day (there were 20 days per cell) and found the standard deviations with those 20 values. But below the table it clearly states: “Each child-day is treated as a single observation.”
Hmm, okay, so how do you get a standard deviation for a fraction? When you have a fraction, i.e. a recording of two possible outcomes, that is basically a binomial distribution.
Wikipedia tells us the variance of a binomial distribution is:
Var(X) = n*p*(1-p)
where n is the number of trials and p is the probability.
But this is the variance for the counts. We are interested in the variance for the fractions. To turn a count into a fraction you just divide by n.
What we are looking for is:
Wikipedia tells us:
Var(a*X) = a² * Var(X)
Sweet. So all we have to do is divide the variance by n². And to get the standard deviation we just take the square root of the variance.
Okay, so what’s the n? Of course they don’t tell us the n for each cell.
In the text they say the “study included 40,778 total child-day observations, with roughly half in the treatment group”. So I guess we can suppose each cell has around 10,000 observations since there are two months in each group, although as I mentioned above, it seems not every vegetable was served every day, so the number of observations for the bottom 3 rows could be less than 10,000, and indeed some rows must have less observations than the first row for the fractions to make any sense.
Using this formula:
SD = root(Var(X/n)) = root(n/n²*p*(1-p)) = root(1/n*p*(1-p))
with n=10,000 we can show that the standard deviations reported are off by a factor of 100!
Let me say that again. Off by 100X!
The fact that we assumed the n is 10,000, which happened to provide numbers that were 100X smaller than their numbers, makes it easy to see the mistake they made. It is clear they are using this formula for their standard deviations:
That formula reproduces all of their gobbledygook, except for Control group, Month 2, Broccoli. The fraction reported there is .018, and the same fraction is reported in the first column and first row, and yet the SDs are different. They can’t even consistently report incorrectly calculated values.
Perhaps convenient for them, their values are off by around 100, which could theoretically allow them to claim their standard deviations are standard deviations for percents instead of fractions. However, as I said before, the rows must have different sample sizes for the fractions to make sense, and as a result the SDs should not be consistently 100X larger than the SD obtained for a n of 10,000.
Unfortunately, we’re still not done. Interestingly they mark all the percent changes as statistically significant except for the last row. They state: “Significance based on an F-statistic of differences in percent”.
I’m not quite sure what statistical test they are using, but you would think that with sample sizes around 10,000 per group any difference would be statistically significant. I ran some simulations, and it seems pretty clear the changes in the last row should be statistically significant regardless of what test they are using. The only way they might not be significant is if the sample sizes actually aren’t that large, which could only occur if carrots were rarely served.
After an email discussion I am now aware of another possible explanation for the standard deviations reported. Because the explanation is long enough to merit its own blog post, I have posted it at the end of this post.
In this post I only focused on the mathematical impossibilities in the two tables of an important paper funded by the USDA. Not surprisingly, the text of the paper contains numerous other inconsistencies. For example, in the abstract it is stated that the number of children in Study 2 is 1,017, but in the text the number changes to 1,552.
In addition to these errors, an entire post could be written just on the inappropriate methodology and statistical tests used in this paper, such as assuming independent observations when in fact the same students are having their choices recorded each day, or employing a high school student to carry out Study 2, who presumably was not blinded to the expected outcome of the study. (I can’t help but wonder if this lab also gets high school volunteers to do their stats for them.)
This is the 11th paper from this group that we have found with mathematical inconsistencies. Many errors appear to be caused by incompetence. But can incompetence really explain all of the errors? Let’s take another look at the first table.
Does anyone else find it strange “number eaten” and “number uneaten” always add up to whole numbers?
Of course it’s impossible to know how this happened, because they never provide any raw data or code, and when you request data they deny your request.
How many more papers with problems do we have to find before something is done? If this is the type of work Cornell endorses at a minimum I suggest we take any work that comes out of Cornell with a grain of salt. Even Cornell News agrees:
P.S. I have to acknowledge Nicholas Brown, Eric Robinson, Tim van der Zee, and James Heathers for their contributions to the presented investigation.
Hopefully this is the last paper from this group I have to look at; consuming these papers gives me indigestion.
Addendum 20170220 continued…
When my colleagues and I critically read this paper, every single one of us flagged the standard deviations as unusual. If you are familiar with basic statistics, it is easy to understand why. When you have a normal distribution the mean+- 2 SDs will capture 95% of the data. However, the standard deviations reported in this paper were many times greater than the mean. For example, they report a SD of 0.045 for a mean of 0.002, which is 22.5 times the mean.
To show you just how ridiculous this number is I plotted an arrow at .002 and a bar showing +- 1 SD.
Assuming a normal distribution, about half the values would be negative fractions, which is impossible. Hence the confusion.
Just that statement, “half the values”, is in itself confusing, because we really only have one value, the mean/fraction. What is a standard deviation even telling us when we only have one value?
That is why when all of my colleagues were asked if they understood the standard deviations reported I received nothing but *shrugs*.
I then thought about how you can possibly have a standard deviation for a fraction, and decided to assume a binomial distribution and calculate the standard deviation from the variance of a binomial distribution.
In doing these calculations, I thought I had discovered the error the authors had made in their standard deviations. Through a recent email conversation I am now aware of another way to interpret their numbers.
First, I would like to show my interpretation of a standard deviation for a fraction, which I believe is the correct interpretation, assuming a SD for a fraction is even a thing.
Below are the results from 100,000 simulations where a binomial distribution was assumed, 10,000 draws were made, and the probability was set to 0.002.
Despite having a skewed population (p=0.002), we get something looking like the normal distribution — the central limit theorem at work.
For complete transparency, here is the code:
Yes, I know matplotlib has a built-in histogram function, but whenever I use it the plots look like shit, so I prefer to write my own.
Now, given this distribution, I would expect the SD to cover approximately 68% of the data. Here is the exact same distribution plotted with +- 1 of their SD of 0.045.
There isn’t a world where the reported standard deviation matches this distribution.
Here is the same distribution with the standard deviation calculated from the variance of a binomial distribution, which is what I did in this blog post.
Seems just a bit more reasonable.
But this gets back to the original problem; we don’t have a distribution of fractions, we have a single fraction, so is the term “standard deviation” even correct?
I went back to Wikipedia, and it seems the preferred term is a “binomial proportion confidence interval”. So what I derived in this blog post should be interpreted as an approximation of this interval. It is very similar to a standard error.
In my email discussion, the conversant was adamant that the standard deviations were calculated from a distribution of values.
Okay, again, what values, all we have is one, the mean/fraction.
If you literally code the draws as 0 and 1, and then apply the typical standard deviation formula to these values, you can reproduce the standard deviations reported.
Assuming a sample size of 10,000 and their provided fraction of .002:
which rounds to .045.
Now, here’s where things get interesting. Let’s assume a sample size of 100,000:
We get the same number.
So, just like the formula I provided in the blog post which is unaffected by sample size, and reproduces their values, this method also reproduces their values and is also unaffected by sample size.
But what exactly are these standard deviations telling us? The standard deviation is supposed to tell you the variation in your data. In this case, the data is only 0’s and 1’s. What variation are you describing when you provide a standard deviation? All the information we need is the mean/fraction, which tells us how the data is weighted.
Many statistical tests use standard deviations to calculate F-statistics, and indeed this paper claims an F-statistic was calculated. So does that mean they used these values? Or did they first divide their standard deviations by the root of the sample size to get a standard error/“binomial proportion confidence interval”? Or did they just force feed their data into some statistical program and reported whatever came out?
Technically the numbers they report can be explained by using the standard deviation formula on a distribution of 0’s and 1’s, but just because a formula can be applied doesn’t mean it should be. You can put lipstick on a pig but it’s still a pig. Reporting a standard deviation for 0’s and 1’s is completely pointless.
As a result, we now have a couple options to explain the standard deviations in this table.
The authors had their data coded as 0’s and 1’s, just hit the “calculate mean and SD” buttons, and reported those values, without thinking about how silly the values would look. The SD should give the reader an idea of the spread of the data, but their values do absolutely nothing the mean doesn’t do.
The authors tried to calculate the standard deviations using the variance formula for a binomial distribution, but made a mistake.
After thinking about this, Door A is probably more likely. Door B requires the authors to make a mistake, which I was tempted to assume because of the authors’ track record of incompetence. Door B also assumes a binomial distribution, which may not be appropriate for this data — subpar statistical models haven’t stopped the authors before.
At best, the standard deviations reported by the authors are silly, at worst they were calculated in error and were used in subsequent statistical tests. But we’ll never know, because the authors do not provide any methods, and even when interviewed by Retraction Watch do not provide any form of rebuttal.
In total, I received feedback on the reported standard deviations from 4 other scientists well-versed in statistics, and every one had their own theories about the numbers. When reading a paper you shouldn’t have to theorize about how a number as simple as a standard deviation was calculated and what it means.
But this is par for the course for these authors. Just look at the methods of this paper. They claim the “choices at each meal were unobtrusively recorded”, then immediately go on to say “the weight of any remaining carrots was subtracted from their starting weight”. How do you get the weight of the initial carrots “unobtrusively”?
The vague methods described by the authors, combined with sample sizes that change as often as the wind, leaves readers wondering how the study could have possibly taken place as described. If you don’t believe me, here is another good example from this lab. And that is all before you even get to the statistical methodology, if you can call it a methodology.
Everyone involved in this kerfuffle should be embarrassed. The researchers who let blatant errors slip into their work, the journals who sent these papers out for review, the peer reviewers who can’t even notice when sample sizes change, the readers who either didn’t notice these blatant problems or didn’t say anything, the media who reported on these sexy stories without doing any checks, and last but not least, Cornell University, for supporting, endorsing, and publicizing this work.