Sorcery or science — the world of cherry picked data and sloppy statistics

Yours Truly
Healthcare in America
5 min readOct 13, 2016

I am naïve PhD student in developmental biology, just over the first year mark. I work with mice. I cry when I cull them for experiments. But at least I liked to think it was… FOR SCIENCE! Unfortunately, I am long past that dream: this text is an anonymous attempt to find some scientific solace, confirm or infirm the doubts voiced below and, through that, attain some peace of mind.

Context

The problems in my lab are multifaceted. Context first, though: our lab uses a certain transgenic mouse line, which means that we can delete a gene in a specific population of cells. In order to do that, you must first introduce in the genome (DNA) of the mouse the code for a protein that is only expressed in your tissue of interest. The second step is to mutate the gene that you want to delete so that it becomes responsive to the aforementioned protein. This second gene will be mutated in every cell of the body. However, the protein that can delete it is only expressed in the tissue that we care about at a specific time. There are three possible situations in this model based on the ensuing genotype: total deletion of the gene (homozygous mutant), partial deletion of the gene (heterozygous mutant) or no deletion of the gene (wild-type). Most types of studies like this check the type of mouse that you analyse and then look at the differences between the control (wild-type) and the two mutants (homozygous and heterozygous). These differences in our case are number of cells in a certain region. We call that the phenotype. So far, so good.

Phenotype, Genotype Mumbo Jumbo

Unfortunately, in our lab, a current practice is to not believe in the genotype. Therefore, we use the mice for doing the phenotypical analysis (counting cell numbers) before we confirm the genotype of the mouse. We analyse them and classify them based on that effect considering the samples that ‘look normal’ wild-types, the ones that ‘look a bit bad’ heterozygous mutants and the ones that ‘look really bad’ homozygous mutants. Before I joined the lab, this is where the experiment would have stopped: no genotype evidence for the classification.

As an enthusiastic first year PhD student I have attempted to genotype the mice that we were using for the analysis, only to be told that this is excessive on my part. However, I persisted in my folly as I considered that you need to first properly identify your samples before drawing any conclusions. This would ensure that you are correctly classifying your samples in order to detect the real effect of the gene deletion. Am I wrong to think that? If you do not genotype, do you not fall in the trap of seeing what you want to see? I understand that a common assumption is that the total deletion will lead to a greater effect than the partial deletion. But when you fail to test this assumption, is this not sorcery instead of science?

That was problem number one, and the first alarm bell: phenotype without genotype.

Controls — Who Needs Them?

Problem number two derives from problem number one: the lack of wild-type controls. It took me a few months to convince my lab that, in order to ascertain any effect, significant or not, we would need to compare the mutant mice to the wild-type ones. I thought that using a control would be a minimum requirement to show an effect in any experiment. Boy, was I wrong in my assumption… My happiest day in the lab was when I managed to convince our group about the importance of controls, in my humble opinion, basic 101-science-fact-check issue.

Dreams of Consistency

Onwards… I keep trying to dream of consistency. It kind-of works — unless I take into account that we might be comparing the number of cells in different regions of our tissue! Let’s say, for argument’s sake, that our tissue has 2 distinct regions, A and B. In each of the regions, ideally, you could take two adjacent measurements of cell numbers. Unfortunately, the dissection is not easy, so you can end up with only one region — A. It matters not — in our lab it is suggested that you can easily identify with which region you are left, although there are no good reference points after the dissection. In conclusion, it is left entirely to your subjective judgement (or expertise) to decide on the region you are analysing — A or B. You just hope you are consistent… Until you find out that the inter-observer variability is a bit higher than expected, and that ‘the experts’ are just as confused as a silly first-year PhD student when it comes to identifying region A… If not fully befuddled, I believe that at a minimum we are opening ourselves up to some easy confirmation bias.

Statistical Balderdash

The way we could be misusing statistics could also add to the aforementioned confirmation bias. I am no expert, however I smell something very unholy in our advocated statistical method (summing up rather than averaging your results). When counting cells in a certain region and comparing the mutants to the controls, my instinct says ‘average and standard deviation or standard error’. This way, you can look at the effect (difference between mutants and controls) and at the distribution (variation) of your sample. From your preliminary data, you can calculate the sample size needed for your expected effect. If the effect found in the preliminary experiments is too small and would need too many mice to determine, this calculation would save furry rodent lives, time and lab resources. However, in our group effect and sample size are not considered important, leading us ever further into our sisyphean research.

Well, the method our lab employs is summing up the data (the numbered cells) in each group as opposed to averaging the data. This strikes me as wrong: you cannot explore the variation in your sample! As such, the effect could come from only one mouse if it is an outlier. Not to mention that, in animal research, there is a very high chance that you will not have the same number of samples in each category. Thus, you cannot sum up your data! So you are left with restricting the number to the ‘representative’ samples in order to compare the same number of mice in each category. Doesn’t this sound like an invitation to cherry-picking?

Heaven, I’m in heaven

These are just a few phenomenons in our lab that I’ve felt the need to rant out loud about, as I’ve been repeatedly stonewalled in trying to discuss them internally. Nonetheless, I’m trying to keep the battle alive.

As a post-scriptum, some of the people working on those experiments are my friends, and I don’t want to point fingers and disrupt science, reputations and livelihoods (my own included) — as such, please see this short text as an anonymous cry for help and not an accusatory pamphlet or dagger in the dark. Is all this just the rigidity of an inexperienced researcher or is there cause for concern?

Yes, heaven, I’m in heaven

And the cares that hung around me through the week

Seem to vanish like a gambler’s lucky streak

When I’m learning how to cherry pick!

--

--