The Biggest Misunderstanding about Behavioural Insights

Kristian Sørensen
12 min readMay 26, 2022

Behavioural insights come from experiments. Experiments are done by behavioural scientists and — to an increasing extent — behavioural science practitioners.

The standard method for conducting behavioural experiments is called null-hypothesis significance testing (NHST): you state a hypothesis, collect data, and conduct a statistical analysis. If the statistical analysis leads to what is called a p-value of less than 0.05, you have “discovered” a behavioural insight.

Unfortunately, NHST was never supposed to be applied this way and very few people, researchers and practitioners alike, understand the basic concepts of the procedure.

Feeling confident that you know how NHST works? What a p-value means? As you will see shortly, you have probably fallen victim to your own overconfidence.

If you read this article, you will learn what a p-value means, but there’s a high probability that you don’t want to be anywhere near them afterwards.

But why are discussions about null-hypothesis testing and p-values important?

Multiple crises of replication

In 2015, Brian Nosek and 270 peers’ attempted to repeat 100 published psychological experiments. Although 97 percent of the 100 studies originally reported statistically significant results (i.e., reported p-values of less than 0.05), just 36 percent of the replications did.

Behavioural science fan favourites such as social priming, ego depletion, and the facial-feedback hypothesis seemingly do not pass the ultimate test of scientific credibility: replication.

But failure to replicate isn’t unique to behavioural science and psychology.

Medicine, bio-medical research, economics, neuro-imaging, cognitive neuroscience, environmental and resource economics. The list of research fields experiencing some degree of replication crisis grows longer and longer.

This is nothing new. The replication crises actually began in the 1950s.

But what if a major contributor to the crises is the way experiments are taught, conducted, and assessed in the first place?

According to the late Jum Nunally from the University of Illinois, even the publish-or-perish problem was well in place in 1960s: “The “reprint race” in our universities induces us to publish hastily-done, small studies and to be content with inexact estimates of relationships.”

Let’s take a step back and go through a concrete example of setting up a null-hypothesis significance test.

Does branding influence children’s food choices?

Let’s say that we are interested making children eat healthier foods. We hypothesise that using branding can influence children’s food choices. This hypothesis was tested in a study by the well-known eating researcher Brian Wansink and his colleagues in 2012.

The standard — and, as you will see, highly misunderstood — approach goes something like this:

  1. State your hypothesis: Branding with popular characters should cause children to choose “healthy” food more often
  2. Collect some data: Offer children the choice between a cookie and an apple with either an Elmo-branded sticker or a control sticker and record what they choose
  3. Do a statistical test on the data: “The preplanned comparison shows Elmo-branded apples were associated with an increase in a child’s selection of an apple over a cookie, from 20.7% to 33.8% (χ2 = 5.158; P = 0.02)”

Note the last number in the parentheses above. It’s lower than 0.05, which by convention is the threshold of concluding that the difference between the two groups is statistically significant (we’ll return to what a p-value is and what ‘statistically significant’ actually means later).

Lastly, you write up your academic article and end it with a conclusion. In the authors’ words: “… Just as attractive names have been shown to increase the selection of healthier foods in school lunchrooms, brands and cartoon characters could do the same with young children.”

But there’s a very counterintuitive leap from step three to the conclusion that we need to address.

Significance testing is all about how the world isn’t

Our hypothesis is that use of branding increases healthy food choices. But there’s a reason why the experimental approach described here is called null-hypothesis testing.

Instead of testing our hypothesis, NHST flips the hypothesis on its head (a valid thing to do following the shift from classical positivism to falsification, but I’ll leave this point for another article). In this case, the null-hypothesis is that branding does not increase healthy food choices.

So before doing our statistical test, we assume that the null-hypothesis is true; we live in a world where children’s choice of food is not influenced by branding.

We then look at the data and determine whether the data are sufficiently unlikely under the null hypothesis that we can reject the null in favour of the alternative hypothesis, which is our hypothesis of interest. Or, as Google’s Cassie Kozyrkow asks: “Does the evidence that we collected make our null hypothesis look ridiculous?”.

Even though this seems like a relatively well-designed experiment, Wansink and colleagues’ article was retracted in 2017. By 2018, Wansink has had 15 studies retracted.

“One of the worst things that ever happened in the history of psychology”

An even simpler ‘recipe’ for NHST, though stated in technical terms, is the following:

  1. Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses.
  2. Use 5% (i.e., p < 0.05) as a convention for rejecting the null hypothesis. If significant, accept your research hypothesis. Report the result as p < 0.05, p < 0.01, or p < 0.001 (whichever comes next to the obtained p-value).
  3. Always perform this procedure.

This is what Gerd Gigerenzer adequately called the “null ritual”. This basic recipe has been widely used across multiple scientific domains. Unfortunately, it’s really bad science.

The p-value and the “less than 0.05” criteria comes from Ronald Fisher, arguably one of the most influential statisticians ever. He introduced the method of doing experiments, at the time to figure out which kind of fertiliser lead to the best crops. Brilliant as he undoubtedly was, he has received his deal of criticism.

Paul E. Meehl, an American clinical psychologist, wrote in 1978:

“Sir Ronald has befuddled us, mesmerized us, and led us down the primrose path. I believe the almost universal reliance on merely refuting the null hypothesis … is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.”

Robert Matthews, a British physicist, stated in 1998:

“The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug.”

But what did Fisher actually say about null-hypothesis testing?

What Fisher actually said

Fisher’s original null-hypothesis testing consists of the following steps (emphasis added):

1. Set up a statistical null-hypothesis. The null need not be a nil hypothesis (i.e., zero difference).

2. Report the exact level of significance (e.g., p=0.051 or p=0.049). Do not use a conventional 5% level, and do not talk about accepting or rejecting hypotheses.

3. Use this procedure only if you know very little about the problem at hand.

Fisher, and many statisticians after him, argued that scientists should communicate the exact level of significance and never base the results of an experiment on a conventional level of significance:

“. . . no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
— Ronald Fisher

In other words, Fisher’s original null-hypothesis testing is highly misunderstood and the criticism against him seems somewhat out of place.

(The “null-ritual” is actually a hybrid of two separate approaches to hypothesis testing: Ronald Fisher’s original null hypothesis testing, and Jerzy Neyman and Egon Pearson’s decision theory. Allegedly, if Fisher and Neyman-Pearson agreed on anything, it was that statistics should never be used mechanistically by following the same ‘recipe’ for every experiment).

The (low) likelihood that you know what p-values actually mean

There’s been a lot of debate about ritualistic use of hypothesis testing and p-values. As early as 1966, David Bakan wrote that: “The psychological literature is filled with misinterpretations of the nature of the test of significance.”

He even caveated his article noting that “What will be said in this paper is hardly original” — in 1966!

If you work with behavioural science in academia, you know what p < 0.05 means.

But how confident are you, really, that you know what a p-value is?

The p-value is a statistical measure that has been (mis)used to evaluate experimental results ever since it was introduced by Fisher in 1926. He proposed the idea that if only 1 in 20 trials fails (P < 0.05), it is fair to assume that the other 19 did not succeed by chance.

Surely, +50 years of rigorous scientific work and university teaching must have reduced these “misinterpretations of the nature of the test of significance” that Bakan wrote about +50 years ago?

(The next passage is a bit mor technical and can be skipped without missing the most important arguments of this article.)

Even statistics professors get it wrong

Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say, 20 subjects in each sample). Further, suppose you use a simple independent means t-test (a very simple statistical test) and your result is significant (t = 2.7, d.f. = 18, p = 0.01).

Please mark each of the statements below as “true” or “false.” “False” means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct.

  1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). [] true/false []
  2. You have found the probability of the null hypothesis being true. [] true/false []
  3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means). [] true/false []
  4. You can deduce the probability of the experimental hypothesis being true. [] true/false []
  5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. [] true/false []
  6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. [] true/false []

In 2002, Heiko Haller from the Department of Educational Science and Psychology, Free University of Berlin and Stefan Krauss from the Max Planck Institute for Human Development, posed the above question to 44 psychology students, 39 professors and lecturers of psychology (not teaching statistics), and 30 statistics teachers, including professors of psychology, lecturers, and teaching assistants.

Teachers and students were from the psychology departments at six German universities. Each statistics teacher taught null-hypothesis testing, and each student had successfully passed one or more statistics courses in which it was taught.

Percentage of participants in each group who endorsed one or more of the six false statements about the meaning of “p = 0.01” (Gigerenzer et al. 2004; Haller and Krauss 2002)

80% of the professors and teachers teaching statistics got at least one of the statements wrong.

But a lot has changed since 2002. Clearly, in the age of online-courses and easy access to top-quality university teaching, researchers must have learned what p-values mean!

A study from February 2020 suggests that statistical literacy hasn’t really improved.

Regardless of academic degree, research field, or stage of career, most of the 1.479 respondents could not interpret p-values and confidence intervals correctly.

And perhaps more importantly, most of the participants where very confident about their (inaccurate) judgements, i.e., they were overconfident.

But how can such widely used statistical measures, p-values and confidence intervals, be so misunderstood?

Misconceptions, apparently, are easy to reproduce

In an analysis of 30 introductory psychology textbooks, including some of the best-selling volumes in North America, scientists based at the University of Guelph found that the vast majority defined or explained statistical significance inaccurately.

For instance, within three pages of text, Nunally (1975, pp. 194–196; italics in the original) used all the following statements to explain what a significant result such as 5% (p < 0.05) actually means:

  • “the probability that an observed difference is real”
  • “the improbability of observed results being due to error”
  • “the statistical confidence… with odds of 95 out of 100 that the observed difference will hold up in investigations”
  • “the danger of accepting a statistical result as real when it is actually due only to error”
  • the degree to which experimental results are taken “seriously”
  • the degree of “faith [that] can be placed in the reality of the finding”
  • “the investigator can have 95% confidence that the sample mean actually differs from the population mean”
  • “if the probability is low, the null hypothesis is improbable”
  • “all of these are different ways to say the same thing”

Easy, right?

Apparently, a finding that is highly reproducible is that very few people understand p-values and confidence intervals.

Now, what?!

Statisticians have issued warnings over misuse of p-values, and journals even refuse to publish papers that contain p-values.

Jeffrey Spence and David Stanley have tried a more pedagogical approach. They propose the following way of explaining what statistical significance means to a non-technical audience:

“In other words, concluding that something is “statistically significant” is not dissimilar from saying, there is now some reason to believe that the effect is non-zero. I cannot say what it is, it just may not be zero. Effect sizes and confidence intervals can give information about what the effect may be, but statistical significance alone does not provide information about how large an effect may be — it just MAY not be zero.”

Or, we should just use p-values as Fisher originally intended: “… the operational meaning of a P value less than .05 was merely that one should repeat the experiment. If subsequent studies also yielded significant P values, one could conclude that the observed effects were unlikely to be the result of chance alone. So “significance” is merely that: worthy of attention in the form of meriting more experimentation, but not proof in itself.”

Think about that for a moment. The basic criteria for getting the results of an experiment published in top academic journals has actually been “this result is worthy of attention in the form of repeating the experiment.” This is far from how the public and communicators of pop-behavioural science think and talk about “behavioural insights” is if they are universal laws of human behaviour.

Here, we get at a more fundamental reason why statistical significance as a concept is easy to misunderstand.

There are no universal laws in (behavioural) science

The problem is more human and cognitive than it is statistical. Bucketing results into ‘statistically significant’ and ‘statistically insignificant’ makes people think that the items assigned in that way are categorically different; the results from an experiment is either significant or not; a behavioural insight is either there or not.

One of our field’s top communicators, Koen Smets, stated this argument very clearly: “… for some reason, behavioural science seems to produce, in some people, a naïve belief that a single experimental finding at once gives rise to a universal law. This phenomenon appears to afflict novices as well as more experienced people, impairing their ability to think critically and question their hasty conclusions.”

The establishment of The Global Association of Applied Behavioural Scientists is undoubtedly a good development of our field. On the website they state that “[a] need exists to safeguard and maintain the quality and standards of applied behavioural scientists”.

I hope that “increasing statistical literacy” is to be found somewhere on their agenda.

Improving your own statistical literacy

I fully admit to having been influenced by my own overconfidence when it comes to statistics. I passed a course during my master’s where we even calculated p-values by hand.

However, having dug into the history and research within statistical literacy, I have updated my own level of confidence; I know much less than I thought I knew and I am already looking forward to comments from much smarter people on what I’ve misunderstood about the criticism against NHST, p-values, and so on that I’ve presented here — and those I haven’t (e.g., effect sizes).

During this journey, I also discovered that there are many brilliant people out there who are actually trying to change the paradigm.

I’ve become a true believer in the Bayesian approach to inference in which, for instance, instead of simply analysing how surprising the data is given that our null-hypothesis is true, the Bayes Factor can tell us how surprising our data looks given that the null-hypothesis is true relative to how surprising they look given that our alternative hypothesis is true, i.e., it’s a measure that actually says something about the hypothesis we’re interested in: our behavioural insight.

If you want to improve your own statistical literacy, here are some resources that I highly recommend.

Articles

  • Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.
  • Spence, J. R., & Stanley, D. J. (2018). Concise, simple, and not wrong: In search of a short-hand interpretation of statistical significance. Frontiers in Psychology, 9, 2185.

Books

  • McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press.
  • Poldrack, R.A. (2018). Statistical Thinking for the 21st Century. (Online book with examples in R)
  • Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.

Podcast

--

--

Kristian Sørensen

Updating your beliefs about bad ideas using behavioral science and healthy skepticism