Your data processing choices will influence measurement reliability

But, by how much? And, in a predictable way? Is measurement broken?

TL;DR: When processing data you make many choices. These choices can have huge effects on the reliability of your measurement; but I’m not convinced that these influences are systematic.

Over the weekend, 102 stunningly amazing people responded to a twitter poll. I asked how much influence data-processing steps - such as removing outliers and RT trimming - would have on the reliability of the outcome measure. The responses largely reflected the belief that the way we process data can have massive implications on the reliability of our measurements.

I expected that the modal response would be for the .3 range. I did not expect that so many would take the “measurement is broken” option. There goes my misconceptions about being a special little boy with a uniquely pessimistic view on measurement. I mean, thinking it’s likely that reliability estimates could vary from, say, .1 to .8 in a single dataset (dependent on data processing decisions) feels very “everything is on fire”. It’s nice having company though ;)

Let’s dive into the results

The data were taken from Hedge, Powell, and Sumner’s (2018) awesome “reliability paradox” paper. Here, I’ll show results from the Stroop task — watch out for my preprint for other tasks, more detail, and deeper thought. Also, I’ll be sharing code. So you, the kind reader, will be able to run and visualise these multiverses with minimal coding.

In total I visualised 256 possible data processing specifications. There will be many more, and it’s probable that some that were included will be unlikely combinations. All possible combinations of the following were used:

  • Total accuracy rate cut-offs (no cutoff, and under 50%, 80%, or 90%)
  • Minimum RT cutoff (100ms and 200ms)
  • Maximum RT cutoffs (2000ms and 3000ms)
  • Relative RT cutoffs (none, and 1, 2, or 3 standard deviations from the mean)
  • Where the relative cutoff is applied (within trial types, or at the subject level)
  • Averaging method (mean and median)

Once we have the specifications, things get schwifty¹. We process the raw data according to each one. Then the data is passed to the splithalf function, which estimates the internal consistency of the outcome measure as the average of a large number of random split halves (only 50 in this example to keep processing time down).

A quick guide to the multiverse plot: The colourful bottom panel indicates the combination of decisions used to yield the reliability estimate plotted on the top panel. The higher the dot, the more reliable the task (the grey shaded area is a 95% confidence interval around the estimate).

The range of estimates runs from .67 on the far left, to .92 on the far right. The 38% that voted for a middling range of .3 win the ultimate prize; bragging rights in a twitter poll.

I’m still processing my own thoughts on how impactful and/or terrible this is. There are some minor patterns in the results, relative RT trimming at the trial level seems to be more appropriate. But, at the same time, a one standard deviation from the mean cutoff tended to be the best option. Have you ever seen research use such a conservative cutoff? The other options seem to have no consistent influence.

Obviously the data processing has a decent impact, a range of ~ .3 reliability is not trivial. So the decisions must be important. But this importance goes hand in hand with some degree of unpredictable arbitrariness.

This troubles me, sometimes I can’t sleep. What do our numbers even mean?

Hedge and colleagues collected data at two timepoints; this might save us! If the decisions have a consistent impact on reliability I might be able to hope again. Maybe the 26% of respondents that think “measurement is broken” are wrong, and we are safe and sound.

Red = time 1. Blue = time 2

Or, maybe not? There seems to be a rough trend, but hardly a reassuring coherence between reliability estimates at both times.

Measurement matters. The reliability of our measures matters. Without reliable measures, individual differences research is difficult, if not impossible (check out this awesome preprint from Rouder, Kumar, & Haaf).

I don’t have any conclusions yet. I mainly have a kind of dull dread. I do have open questions I find myself muddling through:

  • Measurement heterogeneity is an issue. Could data-processing heterogeneity across studies also be a big problem? If not because different processing might yield different reliabilities; then alternatively, because the same processing specification might still yield very different reliabilities?
  • How bad is this for other tasks?
  • How much of an impact will we see on analyses using these measures?
  • Will hierarchical models save the day? or at least help account for this variability?

If ignoring measurement reliability is a real-life horror story, realising that data processing decisions can have such large, unpredictable, almost arbitrary influences on reliability might just be the horror story sequel nobody was waiting for.

Horror stories always have a sequel…

Epilogue: Getting scarier with the dot-probe

Was that too positive for you? Well, let’s quickly dive into some dot-probe data. I’ve said and written horrible things about the continued use of the dot-probe task. I have half finished blog posts about why we should move on from using it in most research. Usually written in a rage haze with titles like ‘the dot-probe is dead’ and ‘you can’t polish a turd’. No, it should not be described as the gold-standard in assessing selective attentional bias (yes, I have seen it described like this in paper abstracts). This is part of the reason.

Applying the same specifications as above, the range of reliability estimates for dot probe data (from a sample of around 500 adolescents) ranged from -.2 to .66 with the median reliability around .05. Let me repeat; a range of .86

Let that sink in.

A range of Zero. Point. Eight. Six.

I don’t even have a real conclusion here. Some measures are just so unreliable that even the reliability is unreliable.

[1] Unnecessary Rick and Morty reference

Dphil (PhD) in @OCEANoxford @OxExpPsy | Blog: | Podcast: @ReproducibiliT | Open and reproducible science enthusiast