Weeding out noise with ML

8 min readJul 29, 2022

Using XGBoost for improving data preprocessing and semi-automating user data selection

Dealing with any kind of “real-world” data requires extracting signal from noise, sometimes from a lot of noise. As the adage goes, “Garbage in, garbage out.” Any experienced data scientist will tell you that it is better to use a “bad” model with good data than a “great” model with bad data. But many times, the time and effort required for this preprocessing and data cleaning step can be significant, which is even worse if this is a recurring event.

An example of such a scenario comes from neuroscience research, where performing electrical recordings of neurons in the brain generates noisy data. Neurons (when active) produce exquisitely stereotypical responses that look like a sudden spike from low amplitude activity, known as an action potential (see figure below). To know when neurons are active, we must extract when these spikes happen in a process called spike-sorting (Ref. 1). However, although the figure below shows very “clean” spikes, this is rarely the case, and usually such data is riddled with noise of amplitudes similar to spikes, but with a different waveform shape (more on this later). Therefore, part of the process of spike-sorting is to also separate signal from noise.

A) Example electrophysiological voltage recordings showing spikes (action potentials) from 2 different neurons (blue and black) B) Example single spikes from corresponding colored traces in A (Image from Ref. 2)

This can be a laborious process, made worse if significant noise is present. An example schematic of this pipeline (the one we use in my lab) is given below. For the purposes of this article, we are primarily interested in steps 1,2, and 3 — processing the collected data (but I encourage you to learn more about my lab’s research in Ref. 3).

Spike-sorting pipeline. Figure from Ref. 3

To briefly go over these steps:

The electrophysiological signal is high-pass filtered to retain frequencies in the spike range and a threshold is placed to pull out “putative” spikes.
The putative spikes are clustered automatically (we use Gaussian Mixture Models) using models with a range of clusters.
Clustered spikes are selected manually, or further processed before selection by a user, as needed.

Visually, this process is shown below. In Step 1, putative spikes (deflections which cross the threshold) are marked with red dots. Clustering of these spikes using a 3-component Gaussian Mixture Model is shown in Step 2. Finally, in Step 3, the red-outlined clusters are discarded, and the green-outlined cluster is retained as waveforms from a neuron.

HOWEVER, this process of manual validation and selection needs to be repeated for up to 64 channels or higher (below). In cases like channel 01, we may not get any spikes, and in some cases like channel 02 , the unsupervised clustering may be unable to clearly separate the spikes from noisy waveforms (blue-outlined cluster), requiring another round of clustering before the spikes are accepted.

At this point, we are fully aware that a lot of our effort is being spent on removing noise from our data, and there are concrete advantages to being able to 1) “weed out” noise, and 2) identify signal (that is, spikes). Formally:

Computational resources and time are wasted in clustering “noise”. This can be mitigated by removing noise before clustering.
If noise waveforms are robustly reduced, this will improve user-based data selection as well.
If we can reliably flag channels with no spikes, the user doesn’t have to bother even with going through those channels, again reducing manual time and effort for this process.
Finally, if we are confident in the predictions of the classifier, these predictions can also be used as a feature in the unsupervised clustering step, likely further improving the quality of the clustering.

This identification of signal vs. noise can easily be cast as a classification problem. However, the constraint is that we do not want to falsely discard any spikes (i.e. avoiding false negatives). Hence, we will attempt to maximize recall (disregarding precision for now, as noise can still be discarded in the later manual step) and see where that gets us. In this case, I chose to attain a recall of 0.99.

The goal is not to build a perfect classifier that will remove all noise from the process, but to include an additional preprocessing step that will significantly improve the input data quality and reduce processing time (for both machine and user).

Building the classifier

Since each spike waveform is a temporal snapshot with fairly high dimensionality (75 timepoints per waveform), it would be useful to perform dimensionality reduction before moving forward. Potential features for this classification task can be:

Amplitude of waveform
Energy of waveform
(PCA of) Power Spectral Density (useful for separating periodic noise from spikes)
PCA of waveform

Here, for simplicity, I opted to use the first 10 principal components of z-scored waveforms, which explained 95% variability in the data. Z-scoring waveforms will remove amplitude information (which could be useful in classification), however, given the stereotypy of waveforms vs. noise we can still expect to perform good classification.

I opted to use XGBoost for this problem because of its speed and power (Refs. 4,5). Furthermore, XGBoost can be piped into GridSearchCV from Scikit-Learn for hyperparameter tuning/model selection which makes this process even smoother.

Does it work?

The dataset I used here contained a total of ~2.5 million waveforms, with ~48% true spikes and ~52% noise (from manually labelled data). Data was split as 0.50/0.375/0.125 for train/test/validation splits. Below is a plot that shows the probability of being classified as a spike predicted using the classifier on the validation set. We can see that the distributions for “True Spike” and “True Noise” are well separated, suggesting that the classifier was able to learn good representations for each class.

Given the true labels, we can also determine a threshold that suits our conservative recall needs. To obtain a conservative classifier, I used the test set to titrate the decision threshold, obtaining one that gives 0.99 recall. Below, this threshold is applied to the validation set. We can see that given this threshold, we can discard 93% of the noise waveforms while still retaining 99% of the true spikes.

Below are example waveforms for different levels of “Spike Probability”. We can see that even at probability=0.2, the waveforms look “spike-like” but the quality of spikes increases significantly in the probability=0.8–1 range (which is where most of the “true spikes” lie). Furthermore, we can see that waveforms in the 0–0.2 probability range are quite poor quality, which might make us feel better about discarding “true spikes” in that range as these might actually be noise waveforms that were mislabelled when the dataset was created.

Representative waveforms at different levels of “Spike Probability”. The ratio between the deflection amplitude and waveform noise increases with the probability

Although this is preliminary analysis, these results are quite encouraging. It seems that we are able to significantly discard noise while negligibly impacting the “true signal”. Furthermore, naively benchmarking the performance of the classifier, the prediction time for ~2.5 million datapoints (a comparable size to most of our datasets) was ~22 seconds using a single processor thread. This makes this potential preprocessing step highly feasible for being incorporated in the data processing pipeline without significant overhead.

However, this classifier can also be useful at a higher level. While it is evidently effective at removing noise at the single waveform level, given that the classifier has accurately learned what a “true spike” is supposed to look like, we can use it to scan across all our channels and flag channels with no true spikes. This will allow users to skip spike-sorting of those channels altogether.

While this may not initially seem like an issue worth tackling, if we look at the number of channels per dataset with neurons (below), we see that while the distribution is fairly broad, the median fraction is fairly low (~0.3). This means that MOST channels do not contain waveforms from neurons (i.e. significant counts of “true spikes”). Hence, flagging these channels will make the process significantly faster.

We can do this flagging similar to our waveform classification above. Iterate over channels and flag channels which have fewer true spikes than a certain threshold. As a first pass, we will forego attempting to optimize this required count and use the convention from my lab i.e. at least 2000 waveforms. While I will not go into the details of our experimental paradigm here, most of our experiments last 40 mins. 2000 waveforms during that time gives a rate of ~0.83 spikes/second; anything slower than this is usually not useful for our purposes, and hence can be ignored.

Using the same classification threshold determined from our previous analysis, we test how well our classifier can flag ~1700 channels worth of labelled data (with 28% of channels containing neurons, and the rest with no discernible neurons, just noise). The previously determined threshold again gives us a recall of 0.99 (i.e. we correctly recovered 99% of electrodes containing at least 2000 “true spikes”). Despite this high recall, our true negative rate is 58% (i.e. P(pred=False|spike=False) = 0.58). This means that our classifier was correctly able to discard 42% of all channels. Hence, the user will be able to skip over fruitlessly inspecting 42% of the channels on average in a single dataset.

Conclusion

In the “Big Data” age, as we strive to collect more and more data, our ability to manually inspect such datasets continually decreases. This issue is further compounded in cases where the task of the user is to manually separate signal from noise (as is true for most versions of spike sorting). However, in cases like this, we can rely on machine learning approaches to semi-automate our work.

In this article, I showed how we can use XGBoost to 1) reduce noise present in neural electrophysiological data to reduce downstream processing time and effort, and 2) appropriately flag sections of the data so that the user can avoid manually inspecting sections without relevant “true” signal.

This preprocessing step will soon be added to the Katz Lab spike-sorting pipeline (Github). Code for this analysis is available here.

My sincere gratitude to Nishaat Mukadam, Jian-You Lin, and Hannah Germaine for their helpful feedback on this article.

References

http://www.scholarpedia.org/article/Spike_sorting
Differentiation and Functional Incorporation of Embryonic Stem Cell-Derived GABAergic Interneurons in the Dentate Gyrus of Mice with Temporal Lobe Epilepsy — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Electrophysiology-of-transplanted-neurons-into-the-host-brain-circuitry-A-The-top-trace_fig7_51983957 [accessed 26 Jul, 2022]
Mukherjee N., Wachutka J., Katz D.B. Python meets systems neuroscience: affordable, scalable and open-source electrophysiology in awake, behaving rodents. Proceedings of the 16th Python in Science Conference. 97–104
https://xgboost.readthedocs.io/en/stable/
https://en.wikipedia.org/wiki/XGBoost