The need for data quality testing at the analysis level — A case study using neural data.

12 min readMar 7, 2023

When analyzing data, we usually have a battery of tests to make sure that the fundamental properties of the data are sound. This can look like making sure the general statistics of the variables are within range, the structure of the data looks good, that we dealt with missing values, etc. This is slightly more challenging when working with timeseries data as imputing missing values and detecting outliers depends on what came before and after a particular data point — nonetheless, there are adequate ways to deal with each of these to make sure that you get structurally sound data to analyze.

However, when dealing with rich datasets, making sure your data look “structurally” sound might not be enough. Generally, the first order statistics of the data might be perfectly within range but higher order structure could be off. This leads to another issue. If you are trying to investigate the data in a particular way, without having a test specific to your analysis, you will not be able to say whether a negative result is due to the data or the concept. Certainly, all datasets are not created equal, and in certain situations, it can be paramount to know which datasets are helpful and which are not. This is explained more below.

A quick aside, this might seem like I am advocating for cherry picking data — what I am actually advocating for is making sure that the data you have even allow asking the question at hand, that the necessary conditions for your analyses are fulfilled, which is not a given, particularly for sparsely sampled data. This is certainly a grey area, and domain expertise is required to make sure that one is fulfilling a necessary condition rather than cherry picking — a difference I hope to illustrate through this case study.

A (very) quick introduction to neural dynamics

Before we can dive into this case study, I will briefly discuss the type of data I am using for this case study.

To start at the basics, neurons in the brain respond to different stimuli by “spiking”. This is shown for a touch-responsive neuron in the cortex which responds strongly to touching a specific point on the arm. The plots on the right in the figure below are voltage recordings from a single cortical neuron which shows large deflections in it’s voltage when the receptive field is touched. Hence, the measure for a neuron’s activity is it’s “firing rate”, which is usually given in spikes/second or Hz.

However, neurons are more than just biological transistors, and rather than simply turning on and off, they show rich dynamics in their firing rates (or neural activity, which I will use interchangeably), which is exactly what my lab studies in the context of how the brain processes taste.

The figure below shows the response of simultaneously recorded neurons from the gustatory (taste) cortex of a rat immediately after giving the rat some sugar solution (stimulus) — each row in a single panel represents a different neuron, each tick-mark represents a spike from that neuron at the given time, and each panel represents a repeated taste delivery (trial). We can clearly see that the firing rates of the neurons are changing over time. We can describe such changes in many ways, the one shown here uses a Hidden Markov Model (HMM) to say when the population of simultaneously recorded neurons transitions from one set of firing rates (state) to another, where the colors denote different states and the lines mark the probability of a given state at any time.

Figure from: Sadacca BF, Mukherjee N, Vladusich T, Li JX, Katz DB, Miller P (2016) The Behavioral Relevance of Cortical Neural Ensemble Responses Emerges Suddenly. J Neurosci 36:655–669.

However, as is common in polished publications and reports, the figures shown are generally the prettiest/cleanest version of the data. The neurons above clearly demonstrate the dynamics, but all datasets are not created equal. As someone investigating neural dynamics, that job would become much harder if my (otherwise perfectly valid) dataset simply did not have neurons which were taste-responsive or showed dynamics in their activity (an outcome I have encountered multiple times before). Hence, we come to the meat of this thread.

Analysis specific testing for neural dynamics

My lab studies how neural dynamics are related to taste processing and how such dynamics change under different contexts (e.g. as the animal develops an aversion to a given tastant). However, as alluded to above, not all neurons respond to tastes, and not all neurons show dynamics in their activity — there is a fairly broad distribution for both the responsiveness and dynamicity of neurons and one can always draw a biased sample of neurons such that most of the recorded neuron are neither responsive or dynamic. Such a biased set of neurons would make it painfully hard to actually study dynamics.

However, maximum-likelihood models of dynamics like a Hidden Markov Model do not explicitly tell you when they have poor signal, that is, if a HMM shows you state transitions, it is unable to give you the confidence of it’s predictions. Hence, it might still tell you there are transitions in your biased “undynamic” dataset, even if that is a poorly supported inference. While Bayesian models do provide an uncertainty estimate (as we will use later), there is one more issue we have to account for. This would mean that unless you can pre-screen data to match the requirements of an HMM, you cannot be confident in the outcomes of your inference.

Another “problem” with a neural population that could hamper inference of dynamics could be inconsistency across multiple trials, i.e. neurons do not show a consistent pattern of firing dynamics across multiple repetitions. This is a widely observed phenomenon in neural activity, however, it is virtually a deal breaker for models like HMMs which assume consistency across the multiple timeseries it is trained on. The reason for this is that without explicitly accounting for inhomogeneity in training data, many models regress to the mean and, as shown in the figure below, if the clusters of data points (in our case, separate trials) are different enough, the mean does not reflect most of the dataset.

For multi-modal data (in other words, data with clusters), the mean of the distribution is unable to reflect the actual data well (Image from: https://www.statology.org/bimodal-distribution/)

In summary, our two conditions for performing HMM modelling are:

Making sure the neural population actually has dynamics to model
Making sure the neural population has consistent activity across trials

Both of these are addressed by regressing the neural activity to a template of the dynamics we expect to see.

Again, I note that selecting datasets with “good” dynamics in this case is not cherry picking as we are not trying to prove the existence of these dynamics (or even quantify the extent to which these dynamics exist), but in our analyses we are already assuming these dynamics exist (and without which our analysis would tend to fall apart).

Template Similarity as a metric for neural dynamics

The HMM figure above shows that state transitions happen on different times during different trials. While that is certainly true, the range in which these transitions can happen is reliably constrained. This means that we can represent an average set of states or an idealized trial as below. We will try to see how well the average firing rates for a neural population match to these templates.

Below are the average firing rates for two example neural populations that we’re interested in determining the dynamicity of. At first glance, we can see that the firing rates of Population A vary much less strongly (i.e. are less dynamic) than those of Population B. Looking at the z-scored firing rates, Population A certainly has changes in firing rates over time, but much less strongly than Population B. Visual observation is great, how do we quantify this?

Top row: Raw firing rates, Bottom row: Zscored firing rates for same populations as top row.

A rather straightforward (if perhaps TOO simple) way would be to regress the template against the population activity and see how well they match. This can be done easily using some linear algebra. We will regress the template to all our trials simultaneously (rather than doing something like regressing to the trial-averaged firing rates); in this manner, a good regression will mean that our data both show dynamics and are consistent across trials. The figure below shows the complete data being regressed.

15 concatenated trials of template dynamics (top) and actual data (bottom)

We can perform this regression as follows:

# template ==> (# of templates) x time
# firing   ==> (# of neurons) x time
# Hence, to get:
#      firing = Beta . template

#  our projection matrix (Beta) will have dimensions:
#      (# of neurons) x (# of templates)

# We can perform multivariate linear regression to obtain Beta using:
#      firing . template' . inv(template . template')

# Where:
#      . ==> matrix multiplication
#      ' ==> transpose operation
#  inv() ==> inverse operation

Once we have our projection (regression) matrix, we can see how well our data matches with our expected template.

At this juncture, we could have certainly used non-linear regression to potentially improve our fit, but as we will see, the linear version gets the job done.

The most straightforward way to check this match would be to look at the R² value for the fit. While that approach is perfectly valid, here I will opt to instead look at the reconstruction accuracy of the template given our data — which is a direct way of investigating how much of our desired dynamics are present in the neural population activity. The advantage of this approach is that it will tell us how much each individual template (and hence each period of activity) is represented in the neural activity. We will use this information to give a higher “score” to neural populations which represent all templates well, rather than neural populations which very strongly recapture a single template but only poorly recapture the rest of them.

Once we reconstruct the template dynamics following our regression, we can score the reconstruction by looking at the dot product between our original template and the reconstruction (which gives us a metric with values 0 →1). These operations can be performed as follows:

# Given:
#      firing = Beta . template

# Template reconstruction can be done using:
#      template_reconstruction = inv(Beta) . firing

# And the reconstruction can be scored using:
#      reconstruction_score = dot(template, template_reconstruction)

The results of performing these operations on our populations of interest is as follows. We can see that both populations were able to reconstruct the templates, albeit to different levels. While neither reconstruction was perfect, that of Population A was particularly poor and dominated by noise. This leads to much lower reconstruction scores for Population A compared to Population B.

1st Row: Raw Firing Rates, 2nd Row: Z-scored Firing Rates, 3rd Row: Heatmap of template reconstruction, 4th Row: Line plots of template reconstruction, 5th Row: Reconstruction Scores per template

As I mentioned earlier, we want to aggregate the reconstruction scores for each template into a single metric such that we favor populations with high scores across all templates as opposed to populations which have skewed scores (high for some templates and low for others). One such metric would be the magnitude of the score vector for all templates i.e. || (s_1, s_2, s_3, s_4) ||. As an illustrative example, a population with perfect reconstruction for a single template but pathetic reconstruction for the others — (1,0,0,0) — would have a magnitude of 1, whereas to reach a similar magnitude with evenly distributed scores across all templates, one would only need each template to have a similarity of 0.5, i.e. (0.5,0.5,0.5,0.5).

All this being said, we want more than plots to make sure we’re on the right track. So…

How are we sure this is working (Intro)

This question can be further broken down into two parts:

How are we sure this method is really able to detect the dynamics like we expect it to?
How are we sure this metric is actually useful?

This first question asks whether this metric works the way we expect it to; the second one asks whether having this working metric is actually useful to us in our downstream analyses.

How are we sure this is working (Part 1)

The first question is partially answered with the plot above which shows us that the population we can visually see dynamics in (population B) performs significantly better than its counterpart. When we perform this analysis for multiple datasets, we obtain the following plot:

The plot above shows that a single session (single population) can exhibit different levels of dynamics depending on the taste that was given (where each population is recorded responding to 4 different tastes). Hence, it might be prudent to filter data (even for the sample population responding to different tastes) if we’re interested in population dynamics.

Another interesting observation is that the reconstruction similarity (i.e. the “Dynamicity” of the population, since strong similarity to the templates means that the population activity is highly dynamic) is largely independent of neuron number. This is important as it repudiates the idea that we’re picking up on noise, as simply recording more neurons does not give us more information regarding the dynamics of the population. The figure below plots dynamicity (reconstruction similarity) against neuron count and highlights the highest and lowest values. We can see that the population with the highest dynamicity even has fewer neurons that the population with the lowest template dynamicity.

Another visual we can generate is using similarity to pull out single neurons whose activity patterns match the templates. The plots below show the activity for different neurons and each row indicates the templates they matches best with. Clearly, this tells us our analysis is doing what we want it to, but is this actually useful to us for HMM modelling as we hope it is?

Timeseries of single neuron responses. Each row contains neurons best matching with a given template. Yellow rectangles indicate the time period which the template represents.

How are we sure this is working (Part 2)

Now that we’re confident this metric actually tells us when a population is more dynamic, does it actually help us with our modelling? This question would actually be difficult to answer using a maximum-likelihood HMM (because the model itself is not explicitly sure how well it is doing), but we can instead use a Bayesian changepoint model to see whether the model is more confident detecting changes in a more dynamic population. This outcome would hold equally true for an HMM.

This is indeed what we find. When we compare the uncertainty of the changepoint model (variance of the posterior distribution for each changepoint) to the dynamicity of the population, we find a strong and statistically significant negative relationship, i.e. the model becomes less uncertain about where changes are happening when performing inference on a more dynamic population.

These changes are visually observable when comparing the most vs. least dynamic populations and the neural activity across their inferred transitions. Below, comparing these two populations, we see that for the “Most Dynamic” population, there are multiple neurons which show a sudden increase or decrease in their firing rate across the changepoint (highlighted in yellow) whereas no such changes are clearly visible in the “Least Dynamic” population.

Conclusion

Through this article, I hope I have been able to make the case for analysis-level testing of data. While it is indispensable to sanity check our data (make sure that values are within ranges, missing values are dealt with, outliers are handled) that may not be enough, especially when data collection is costly and you may run into sparse samples which can be heavily biased. In these instances, it is imperative to make sure that the data can actually answer the question you are asking. Domain knowledge is required to develop tests which perform such filtering without biasing the outcome of the questions being asked. However, if such filtering is possible — as demonstrated here — it may be possible to significantly improve the total quality of the data being used and improve inference in downstream analyses.