A Primer on Kok et al. 2013. Part 1: The Paper.

So, a week ago, I wrote a general introduction to writing a Letter to the Editor? But what actually HAPPENS in one?

Here’s how the sausage is made.

******************************************

After a year, a series of reviews which ranged from positive to disinterested to hilariously negative, and far more correspondence than a sane person would admit is healthy, and then some extra waiting around, our commentary on Kok et al. (2013) was recently published by Psych Science.

(When I say our, this is by myself, Nick, Harris, and Jim. But any mistakes here — pronouns, distortions or vitriol — are all on me.)

You can read it here if you have a subscription, or a draft here if not.

It is a critical commentary, and as a consequence it’s something about which my motives might be interrogated. I’m faintly amused that this is the case… but never mind that for now.

For the record, I’ll outline here what happened, what my problems are with this paper, and my motivation for writing a commentary in the first place. Letters to the Editor are often terse to the point of grotesque due to word limits, so it’s good to explain things fully.

The original paper the letter concerns is HERE. The abstract reads:

The mechanisms underlying the association between positive emotions and physical health remain a mystery. We hypothesize that an upward-spiral dynamic continually reinforces the tie between positive emotions and physical health and that this spiral is mediated by people’s perceptions of their positive social connections. We tested this overarching hypothesis in a longitudinal field experiment in which participants were randomly assigned to an intervention group that self-generated positive emotions via loving-kindness meditation or to a waiting-list control group. Participants in the intervention group increased in positive emotions relative to those in the control group, an effect moderated by baseline vagal tone, a proxy index of physical health. Increased positive emotions, in turn, produced increases in vagal tone, an effect mediated by increased perceptions of social connections. This experimental evidence identifies one mechanism-perceptions of social connections-through which positive emotions build physical health, indexed as vagal tone. Results suggest that positive emotions, positive social connections, and physical health influence one another in a self-sustaining upward-spiral dynamic.

I saw this paper in October, 2013. It was immediately obvious from a superficial reading that it had a few problems.

Background 1 — HRV, what is it?

If you’re familiar with HRV, skip ahead.

If you’re not familiar with HRV (Heart Rate Variability), the Wiki is not too bad.

If you’re lazy, here’s the ten second explanation: there are patterns present in the heart beat over time, generally quasi-regular cycles which are driven primarily by respiration and blood pressure — all these systems co-coordinate into what can be thought of as ‘general autonomic management’.

We can make some reasonable estimations about someone’s overall autonomic state by analysing these patterns. In other words, we see altered HRV during exercise, during stress, during anxiety or panic, during close attention or concentration, and so on. HRV is used widely in areas of electrocardiology (where I work now), exercise physiology, psychophysiology, medical science, etc.

It’s as common as dirt — there’s more than a thousand papers a year published using HRV.

My entire PhD was on HRV. It took hours.

Background 2 — HRV, does it work?

That is, does is tell us anything meaningful?

Sometimes.

HRV is easy, but not simple.

Let’s not pretend that it’s hard to record normal heart rate data and process it into simple rhythms. The technique is more than 100 years old, the algorithm that we use for processing the raw electrical signal of the heart into beats is 30 years old. The challenges in working with HRV are ALL in the control and interpretation of the data.

Specifically, the information we get from HRV can be very, very messy. There are a catalogue of reasons for this, and if you want to take a few years off your life with the details, read my papers on it here and here.

There are four situations where I think HRV excels:

  1. when you have massive samples to draw on, hundreds or thousands of participants make it easier to observe weak or mess effects, and…
  2. when your experimental variables are VERY carefully controlled, as they often are in respiratory and circulatory physiology, and…
  3. when you take very large amounts of measurements within the same participant, so for instance, taking heart rate every morning and evening for 30 days, or measuring 8 conditions within the same experiment, and finally…
  4. when you have very strong effects to observe, like drugs which progressively destroy HRV until it is essentially zero (e.g. atropine), or heavy exercise with insufficient recovery (HRV trends to zero above a heart rate of about 110BPM).

In laboratory work, I shoot for a combination of 2. and 3. — a fairly small number of participants, controlled as rigidly as humanly possible, taking a very large amount of measurements for each person. In general population studies, I think the best solution is to double down on 1. — which means using portable devices to gather MASSIVE samples of data (hundreds or thousands of participants instead of dozens).

In this work, you try to control whatever variables you can, and then:

a) try to get as many people as conceivably possible (and pay particular attention to problems of over-powering and false ‘significance’) and,

b) make sure you have a meaningful experimental intervention.

If there’s an overarching problem with work in this field, it’s that while people don’t do a), they REALLY don’t do b).

Working with HRV encourages lazy thinking — people smash together any old scenario with ‘physiological measurement’, cross their fingers and hope.

Consequently, as a technique HRV is absolutely brilliant at producing astounding amounts of data for very little input. If you want a garden-of-forking-paths situation, well, HRV is a landscaping company that delivers projects on time and under budget.

Specific problems with Kok et al. (2013).

Problem #1: Bad experimental control

To cut a long story short, there are a laundry list of factors which affect the magnitude of any given type of HRV over time. We try to control or eliminate these factors by selecting the right sample, and by paying attention to controlled variables like time of day, food and water consumption, medication, co-morbid conditions, and so on.

You’d think this sort of basic experimental control would be well established in the field, but it really isn’t. HRV is used primarily in the ‘softer’ ends of the psychological and medical sciences by researchers who do not consider themselves biologists. So biologically meaningful covariates are often overlooked.

(I should clarify the above — there are good reasons to control experimental variables as per my laundry list, but sometimes it is impossible. In massive samples especially, it is totally impractical. If you want to take heart rate recordings from hundreds or thousands of people, you are forced to accept that your degree of experimental control is diminished and, like Donald Rumsfeld, you go to war with the data you have, not the data you want.)

Kok et al. didn’t get a massive sample, it was less than 80 in 2 groups. This is precisely the type of experiment where basic experimental controls are perfectly possible, and also a small enough sample where that control is necessary. HRV is not so robust a technique that a small, uncontrolled sample will just work by magic — quite the opposite.

It’s a continual disappointment to find papers which don’t have basic standards of measurement rigour, which is both a common occurrence… but also not the end of the world. I try not to worry about it too much overall.

Problem #2: Where the hell are the means?

It is self-evident why you’d need sample means to interpret an experiment. Kok et al. don’t report any.

This is a problem not just because means are the most basic descriptive feature of data, but because they allow you to reference work to an external standard.

For instance, most standard methods of turning heartbeats over time HRV (i.e. natural logarithm of power spectral density btw. 0.15–0.4Hz, FFT, Welch’s periodogram, 4Hz resampling, etc.) would be very close to 6.5, with a standard deviation of 1. I’ve pulled values like this out of literally dozens of separate experimental groups. For low-frequency power during slow breathing, the value can be much higher (a mean of up to 8, values up to 10). This might seem close to 6.5 but remember these are LOG units — in raw units, the difference is colossal.

Means tell you what you’re dealing with, and reflect on the accuracy of the method. There is no reason not to include them. This paper doesn’t.

What is heavily implied from the paper is that their chosen form of meditation increased mean HRV. Certainly this is how the paper was widely represented. But curiously, after a first glance, I realised this wasn’t explicitly stated. Tucked away in the supplementary materials, there was a quick mention of the overall regression model changing… but absolutely nowhere was anything whatsoever mentioned about what the values in that regression were.

Surely if you performed an intervention, and the intervention changed an independent variable you were measuring straightforwardly, you’d say so. But no, not so much as a hint of what the values (or any measure of variance) can be found. And this obviously isn’t due to space —not only was there a supplementary section, but a large chunk of the paper was taken up with graphs which flirt with being nonsensical.

(That isn’t hyperbolae, there’s almost two pages of the paper taken up with these graphs of quartile comparisons. On a heavily space-limited paper, this is a really strange decision. We never figured out why these were necessary. If you have the original paper, pages 1129. and 1130.)

Regardless of circumstances and formatting decisions, leaving out basic descriptive statistics will always, always, always make me suspicious.

Problem #3: HRV values were square-root transformed

This was a real puzzler. Populations of HRV frequency values are always skewed, usually quite heavily — roughly, you get a mean of about 1000 and SD of about 1000 (often a bit more). That’s quite a lot, when you keep in mind HRV measures are almost always some transform of variance, so the least you can have is zero. So, this paper square root transformed the data to remove the obvious skew.

The problem with this is simple: it doesn’t work.

Instead, a square-root transform just sort of squishes the skew a bit without removing it. I could only think of one paper that ever tried this (Kuo et al. 1999), and it says:

Most current applications analyze absolute measurements of HRV without any mathematical (log, square root) transform. However, it had long been noted that HRV measurements seem to distribute in a nonnormal pattern, most likely logarithmic normal (5)

I racked my brains trying to think of another paper with a square root transform (and then gave up and just Googled around a lot, and found a grand total of one more; Hayano et al. 1991). As far as I can tell, that’s the lot — there is no such thing as a square-root transform to remove skew in this context. Everyone, and I do mean everyone, uses a log transform to do this. Same as a raft of other biological variables. We ended up putting a Q-Q plot of this in our article, and the wrong transform just looks bizarre.

This may or may not be a problem for an analysis, it’s hard to say.

However, to me it’s a little bit like Van Halen’s “no brown M&Ms” contract stipulation.

That is, it’s one of those little details which — to me — says very clearly insufficient attention to detail. Every field has something equivalent — a wrong reagent, a reference which is misinterpreted, a central misconception — some commonly overlooked detail. Generally these make you interrogate the paper more closely, because it shouldn’t be there.

It isn’t, of course, a high crime. Or even really a misdemeanour.

Problem #4: High frequency HRV down to 0.12Hz is a real problem in relaxed people

Bear with me, this next part is boring but necessary.

Most people breathe somewhere between 0.2Hz and 0.33Hz (i.e. on about a 3–5 second cycle). As a consequence, to measure their HRV due to respiration (respiratory sinus arrhythmia; RSA), we most commonly measure the spectral power present between 0.15Hz and 0.4Hz — and most people’s respiratory cycle affects their heart rate in that range most of the time.

However, this paper uses a method which extends that lower boundary down to 0.12Hz. I’ve adapted a graph from one of the classic papers in cardiorespiratory activity below so you can see what I mean.

Gleefully adapted from Brown et al. 1991

The x-axis is Hz, the y-axis is power spectral density. I’ve labelled the values in Hz (so, the x-axis values) of the individual points. So the line sliding downwards means as respiration gets faster, HRV due to breathing gets smaller. Let’s not dive any further into the physiological evidence bin than that.

The red line is the typical breathing range, the green line is the frequencies we typically measure to capture that range, and the blue line is the expanded range used here.

The two lines (closed vs. open) are two different tidal volumes (i.e. how much air goes in), and they’re very similar. What you can see is that at the same volume, slower breathing produces bigger respiratory fluctuations (and more spectral power), and the cut-off for this increasing can be estimated at about 0.15–0.18Hz.

So, while this we might assume on this basis, that if breathing patterns don’t change then we don’t have much of a statistical problem. Non-parametric methods are OK at handling a few crazy values if they turn up, and likewise we can always throw the datapoints away because they’re not measuring what we want to measure (naturally, you’d want to state you were doing this in the paper, and not just “disappear” them like some kind of cardiac Mafia).

(We have to do this all the time in HRV, because a lot of people don’t have normal sinus rhythm — in this context, that means their heart beat can’t be considered a process exclusively managed by the autonomic nervous system in the normal way. Idiopathically, this is common in both fit people and young people, both of whom turn up all the time in university-based samples.)

The real problem comes when you ask your study participants to relax.

Relaxation changes the way people breathe. Even someone who’s never meditated before, when they walk into your experiment and you say “we’re going to get a relaxed measurement / measurement at rest / baseline etc. etc.” will occasionally alter their breathing without being asked. Here’s a really good recent example of someone who did this — they were already on the ECG before the baseline started, so you can see the comparison:

Let’s go through this backwards.

We see the heartbeat in the lower panel C — can you see a totally different pattern before and after the dotted lines? Of course. Well, that’s a normal experimental baseline where someone’s been told ‘just sit still so we can get a baseline’. That’s it. No further instructions. Nothing mentioned about breathing.

In the middle panel B, you can see what is making the heart behave differently, it’s the respiratory cycle. The speed of each cycle is slowing down drastically as our participant is told to ‘relax’ for a baseline. The critical threshold, where we know the breathing will start to heavily distort the normal breath-to-heart-rate relationship, is the horizontal dotted line.

Finally, the top panel A is the resultant changes in HRV.

Well, if this is such a problem, surely it’s been investigated… right?

You betcha. The first proposals for using an ‘active’ baseline where participants perform a really easy task instead of just relaxing are at least 20 years old, like this one. We still use this task in pediatric and developmental samples. The point of giving subjects (especially children, who fidget on a professional level) a trivially easy task is to have people minimally engaged, doing something incredibly straightforward, instead of forcing them to relax. A lot of people, of course, use paced breathing at baseline as well.

So, what do we have all up?

  1. We are measuring HRV down to 0.12Hz, below normal respiration range.
  2. People who are sat down and told to ‘relax’ behave differently.
  3. The experimental group have just spent several weeks in some kind of meditative practice.

All together, that’s something of a minefield — measuring to a lower bound of 0.12Hz means anyone breathing slowly in the experimental group is a lot more likely to get measured and presented as evidence of ‘changed autonomic activity’.

The issues go a lot deeper than this… I’d refer anyone still interested in this to Teodor Buchner’s paper from 2011 for some background in sticking the issues together.

SEP Field — Someone Else’s Problem

Having formed the opinions outlined above, probably fairly quickly because they’re by no means unique to this paper, I promptly forgot about the whole thing.

Why? SEP field.

I’m not a social psychologist, and if I tried to correct every bad HRV paper I saw, I’d wear my fingers down to little nubby ends. HRV is very popular in a number of scientific areas, and very popular means lots of papers, and lots of papers means a subset of those will be terrible.

However, purely by coincidence, I was talking to Nick Brown about HRV — that Nick Brown — and he gave me a chance to look at the dataset.

And that’s where the trouble started.

Serious Problem #1: Data quality

The HRV data was full of numbers which made no sense. A real garden of rocks and weeds.

One participant had a resting HRV of literally zero (wildly implausible at rest, as even a heart transplant patient with no cardiac autonomic input has HRV due to the mechanical compression of the heart during the respiratory cycle). Another participant has a value less than 1. This is extremely unlikely unless something is badly wrong, like you were accidentally running at full speed just prior to measurement, or had a severe anxiety disorder.

A few participants had the other problem entirely, values between about 8000 and 24000 — probably evidence of breathing changes as outlined in Problem #4 above, less likely (but possible) to be just data quality errors; these can be errors, but also could be uncorrected ectopic beats.

One participant managed to go from an HF-HRV of a few hundred to over ten-thousand during the experimental period.

This is most certainly and assuredly not some mystical force of positivity improving the ‘nervous system’. This is some kind of mistake, and one which adds an outlier to the group.

There were also a lot of dropouts, to the extent that of the 65 participants, only a hair over 50 had a trustworthy before and after measurements. I think there was even a participant or two who only had an “after” measurement… ?

(And bear in mind, this is the summary data. The raw data I never saw. Now, it might be fine… but, commensurate to the accuracy of the summary data, I would guess that it is dark and full of terrors.)

All told, pretty woolly stuff. If someone in my old lab had brought me this dataset, the riot act would have been read commensurate to the accuracy.

As in, I would seriously question the dataset on the nature of the numbers alone.

Now, I’m fully aware that I’m a pedant, but even so…

Serious Problem #2: The magnitude of the increase in HRV in the meditation group is the same size as the magnitude of the decrease in the control group.

Right.

Despite the length and detail of the criticisms above, it’s possible that everything — including the data full of broken glass and bottle-caps and burning tyres — up until this point might be shrugged off on the basis of:

a) again, it’s not my problem, and
b) well, research isn’t perfect, so cheer up.

Certainly Kok et al. is not alone in the fact that it uses HRV methods badly, it has an extraordinary amount of company in that respect.

On a generous day, all of the above might just possibly be forgivable. Iffy, but hey — iffy happens. We often make our peace with iffy.

But not the next part.

I have graphed the basic descriptive statistics that are in the paper below, using the non-insane log-transformation as appropriate. The differences on the graph are simple paired t-tests (obviously this is not a sophisticated analysis, this is for illustrative purposes only).

Remember, there is a meditation group and a control group, and they are being measured before and after an intervention.

What do you notice? Well, the meditation group is going up — a little, not a lot. And for some unknown reason, the control group is going down.

Now, when you hide all this in a regression, you can see that the difference in the intervention group regressed against the baseline compared to the control group is limping up into some kind of difference (p=0.06).

But the intervention group itself is doing nothing.

HRV does not change over time, that is, meditation appears to have no strong effect. If we look at the extraordinarily basic question of “How many people actually increased their HRV over time in the meditation group?” the answer is just over half — 14/26. Not much of an intervention.

I should add here that this result is inviolate. You can remove or include any of the outliers you like, do the right or wrong adjustment for skew, use simple difference scores or regression, use parametric or non-parametric methods, the intervention did not reliably increase HRV. The p-value skitters around a bit (they always do) but under no circumstances does something happen.

And while we’re at it, a mention and explanation of why the control group should be going down significantly is necessary — these people did nothing whatsoever in the experiment but show up at the start, live their lives as normal, and then show up at the endpoint. If your control group is decreasing significantly on your primary dependent variable absent of any hypothesis of why it might, you are 100% required to a) notice this, b) mention this and c) attempt to explain it.

When you have a complicated bouillabaisse of a mediation model and a fancy explanation of an upward spiral of feelings vs. meditation vs. autonomic modulation etc. over time, and everything turns out to be wonderfully ‘significant’, but you fail to mention that:-

a) your intervention to increase the primary outcome of interest didn’t do a damn thing, and also

b) your control group went down significantly on that outcome…

… then I have a problem.

There are lots more minor issues I would prefer fixed, but I’m sure you get the idea by now. These data-based observations were the bridge from ‘dear me, this is a bit slippery, oh well it’s someone else’s problem’ to ‘we should say something’.

It didn’t help that there was a substantial barrel of exclusively positive and uncritical reporting of this article, a large part of which glossed over any of the finer details and reported something along the lines of positive-y nice-y things make your heart go all healthy.

Now, I know what it’s like to have my work distorted by journalists, but this was insane:

In particular, the two researchers found, during a preliminary study they carried out in 2010, that the vagal-tone values of those who experience positive emotions over a period of time go up.
Dr Fredrickson and Dr Kok discovered that vagal tone increased significantly in people who meditated, and hardly at all in those who did not.
The Economist (hotlink might not work)
In contrast, participants in the waiting-list group showed virtually no change in vagal tone over the course of the study.
The Daily Mail (obviously no-one expects this to be accurate)

Let me be completely explicit about this: neither of those things happened. Demonstrably, absolutely, totally didn’t happen. But this is how it was sold.

And so it goes.

One final thing to address: some responsibility here rests with the journal. Absent of the ability to see the raw data, which they probably didn’t have or ask for, if you’re going to provide peer review then occasionally authors who are applying specific techniques they’re not technically proficient in need protection from themselves. Of course, you can’t divest the responsibility completely (“Someone should have stopped me!”) but you can regard that responsibility as shared.

My first thought after ‘wow, the authors really goofed here’ is always ‘this paper went past two or more reviewers and an editor… ?

Next time in Part 2: Reactions to the letter, the present day, and how nothing whatsoever changes.

And, as always, more scallywag behaviour here.