The Error Which Statistics Can’t Save You From: Researcher Bias

Human Irrationality Can Trump Every Generalization Method

Uluç Şengil
4 min readAug 15, 2018

When you set out to do Data Science, whatever your aim is, you follow steps of the scientific method closely:

  1. Ask a question
  2. Do background research
  3. Construct a model
  4. Validate the model using data
  5. Successful? Great! Communicate your results. Not? Bummer. Go back to step 3.

If you get your desired result at your first try, everything is nice and great. Significance and p-values were invented for this after all. Feel free to boast about how you did everything right and your hypothesis turned out to be correct, then just put your results into the next step of your work. You’re free from this type of model errors.

On the other hand, if you get a negative result at step 5 — as with almost any data experiment — and enter the 3–5 loop, you should beware. Like I said, p-values and significance are tools for the loop-less kind of science. Readjusting your model over and over is the perfect opportunity for any kind of bias slipping into your results. The culprit of this is one of the fundamental truths of doing statistics/data science/econometrics:

Everything is a random variable.

Your coefficients? Of course. Their variances? Naturally. Significance is a factor of those two, so you better hope your variances’ variances are not too high; else you might find large discrepancies between minor adjustments of your model.

Over time, statistics has evolved several methods to protect itself from over-specified and biased models. A few examples of this could be separating test and training data, cross validating and bootstrapping. We know very well that our training error is pretty much a random variable, so using these methods should bring us some rigor, right? Of course. Until you try to optimize your model depending on your “rigorous” error number. This is where Goodhart’s Law comes into effect: When a measure becomes a target, it ceases to be a good measure. In this particular context, when you do countless readjustments you’re letting your personal opinion rule over whatever statistical precautions you have taken.

The Ouroboros of Scientific Evidence (from Slate Star Codex)

Unfortunately, any kind of error or precision number you can find is also a random variable. And random variables are surprisingly easy to unknowingly game. If you keep trying to get better results than a predefined model by doing several adjustments to yours, of course you’ll succeed at some point. Hell, you can do an experiment with a calculator: Let the threshold be 0.025 and keep pressing the random button until you get a number lower than that. It’s a low chance, but eventually you will get such a number. The only difference between doing this and continuously making adjustments to a model to hit a accuracy/error value, is that you know full well what you’re doing is wrong when you’re doing it on the calculator.

So, what’s the solution?

The answer to this is both exceptionally easy and exceptionally hard: Pre-commit.

Write down the models and any possible adjustments you may have for them. Count them, get your adjusted p-values and conduct your hypothesis test according to the adjusted values. If you are using error or accuracy values, then treat them as random variables. Calculate your mean and variance and compare that to your benchmark value. If you have a huge outlier with much greater accuracy, then that’s your successful model. If it’s blended in to your other model’s performance distribution and only slightly surpasses the benchmark, that’s not it.

And the most important rule you should abide is: Do not perform any additional adjustments. Do not perform any additional adjustments. Do not perform any additional adjustments.

If you do, remember the calculator. You’re just pressing the random button again. You’re drifting into the area where statistics can’t protect you from yourself. I do not underestimate how hard it is to acknowledge that you were wrong, but any adjustments further than this won’t be right. They’re just another press on “random” on your calculator.

Instead, take a step back. Maybe take a break. Surf the web and maybe read some of the Sequences (whatever you think of Lesswrong, you can’t deny that the principles are pretty handy when doing actual science). Appreciate the opportunity you’ve just had to defeat yourself and feel the possibility of growth. Do anything that might take your mind off the research and let your brain recover. Another time, you can draft new models, pre-commit and run your hypotheses. Just not now. You need time to formulate new models after all.

Further Reading

I’ve been largely inspired by Scott Alexander’s article on rigor of parapsychology and other disciplines, The Control Group Is Out Of Control. His commentary and prose are quite nice, and the Ouroboros image (which I’ve borrowed) really drives the point home.

About more on-the-surface errors of doing Data Science, I’m currently writing several posts on Medium. Last time I’ve posted on Multicollinearity, and I intend to cover several more deviations from the model, each requiring some more mathematics.

--

--