Crossover: neuroscience meets data science
In this post, we’ll explore the crossroads of two important scientific disciplines: neuro and data science. Specifically, we’ll explore certain data science methods and tools that have enabled significant advancements in neuroscientific research.
Picture a typical day where you find yourself contemplating the dataset of electrophysiological recordings you obtained from an animal subject, maybe one of our closest relatives: the macaca Mulatta. Your statistical background informs you that in order to understand the activity of the neuron population in a specific brain region, it’s essential to study the activity of a representative sample of it. Based on this, you opted to obtain extracellular records. This involves the implantation of microelectrodes, with tip diameters of 2 to 3 micrometers, in such a way that their final position often ends up in the spaces between various small neurons, allowing for the simultaneous recording of their activity.
Therefore, you can’t stop thinking about the fact that the voltage recorded by your electrodes actually contains the diffuse activity of several neurons. Again, your statistical background tells you that you can only get an idea of the population’s activity by analyzing the individual activity of the neurons that constitute that population. This leads to the question central to this post: How can you discern and segregate the activity coming from different neurons?
Deep in your heart, your great motivation comes from the fact that the recordings were made during a behavioral task that took you months to train the animal to perform. Perhaps the task involved choosing between various options, and as a result you suspect that this work will provide insights into the relationship between the brain area activity and the decision-making process. Or maybe, the task involves retaining information over a period, and you hope to get clues about how this neuron population participates in memory processes. Regardless of the specific behavioral task, you know that if you execute your work correctly, someday people might see your name in cognitive neuroscience books.
So, coming back to the important question, let me inform you it was answered long ago through the development of a technique called Spike Sorting. However, before I rush into explaining this tool, allow me to rephrase the question as: is there some type of marker that allows us to distinguish one neuron from another? Fortunately, the answer to this question is a resounding yes! Just like people in society can be uniquely identified by some specific markers like fingerprints, neurons in the brain also have their own identifying markers.
In order for me to continue the explanation, let me invoke your neuroscience background for a moment. As we know, neurons are cells characterized for their capability to produce action potentials or spikes, in answer to certain stimuli. A spike is a voltage oscillation caused by ephemeral structural modification in the cell membrane, generating a momentaneous change in the flow of ions between the inside and outside of a neuron. As we might imagine, the specific course of this voltage oscillation depends on the original membrane structure in the first place. But also there are some other factors like the cell shape and size, the ionic composition of the outer space, the distance between the cell and the site of recording, among others. All these factors together cause each neuron to be identifiable by the specific shape of the action potential it produces in response to stimulus. This is the principle of the aforementioned Spike Sorting technique.
The Spike Sorting technique is composed of several ordered steps, all related to the standard procedures data scientists employ for insightful data analysis. These steps are: filter of data, spike detection and alignment, feature selection and extraction, and clustering. It’s worth mentioning that the specific algorithm(s) used in each step can vary from one technique version to another. The remainder of this post delves into the description of the Spike Sorting technique, but specifically using as a reference the version that I used during my bachelor’s thesis, which you can consult here.
The first step is a simple high pass filter. Given that the spike shape serves as the marker and the recordings encompass a spectrum of voltage oscillations, the process involves discarding slow oscillations and retaining only rapid oscillations that may be action potentials. A lower cutoff frequency of 300 Hz is commonly used. That is, we retain only voltage oscillations over the 300 Hz.
Following the isolation of putative action potentials, just like when you line up a group of children to take their measurements, the next step is to align the spikes within a unique temporal window. This alignment facilitates the analysis of their features without the disturbances of the temporal distance between them. This results in a bunch of overlapped putative action potentials. Occasionally, plotting these overlapped spikes is enough to reveal distinct shapes, implying the presence of multiple neurons. To enhance comparability, this stage typically includes noise elimination to discard aberrant voltage oscillating during the spikes. Also, to ensure continuous signals throughout the window period, interpolation algorithms are commonly employed to fill gaps caused by recording imperfections.
Continuing with the children’s example, suppose after we collected all their measurements, we are asked to form the fairest possible teams for a basketball game. What we would need to do first is to identify the most relevant measured variables for this goal. While features like hair color or favorite music band turn out to be unimportant on the playground, we can’t ignore the age and height of children if we don’t want to end up with a team of 1.5-meter 10-year-olds against a squad of towering 2-meter 15-year-olds. A similar process is implied in the next Spike Sorting step. We need to ask ourselves: what are the most important features of spike shapes in order to differentiate them? Fortunately, the answer to this question is that we don’t exactly know, but also we don’t exactly care. This is because of an old friend named Principal Component Analysis or PCA. This tool constructs a set of new, highly relevant features known as principal components, achieved through specific linear combinations of the original features. So, although we might not know exactly which and how the features go into each component, we can be sure that their linear combination is forming a very informative feature.
Once we have the features to take into consideration, it is time to group those spikes sharing similar measures. To achieve this, a cluster algorithm such as k-means can be employed. This is an unsupervised learning algorithm that requires a number of groups to look for as an input parameter. Therefore, we now face the question of how many neurons we suspect there are in the analyzed recording. Although at this point we potentially have visual insights from the overlapped spikes plot and the behavior of data in a principal components space, it is important to be sure if we don’t want to consider extraneous voltage oscillations as neuron activity. Luckily, there are algorithms to estimate how good different group numbers are. An example is the Silhouette metric, which simulates grouping with various cluster numbers and outputs a value indicative of the efficacy of each number. The computation of these values is grounded in the similarity of elements within a group (cohesion) and their dissimilarity with elements of other groups (separation). This resulting value falls within the range of 0 to 1, where a higher value implies a better grouping for that specific cluster number.
After this last step, you have segregated your recording into the activity coming from different neurons. Once you apply this technique to all of your recordings, then you’ll have a whole sample of the neuron population activity obtained during the behavioral task. From this point on, to get your name into a cognitive neuroscience book, it will depend on the results of the statistical analysis of this data. But that… will be for the next chapter.
References
Barreras, J. C. (2021). Papel de la corteza premotora ventral en el procesamiento de memoria de trabajo somatosensorial y auditiva. México: UNAM, Facultad de Medicina. Tesis de Licenciatura en Neurociencias. https://ru.dgb.unam.mx/bitstream/20.500.14330/TES01000833713/3/0833713.pdf
Paraskevopoulou, S. E., Barsakcioglu, D. Y., Saberi, M. R., Eftekhar, A., & Constandinou, T. G. (2013). Feature extraction using first and second derivative extrema (FSDE) for real-time and hardware-efficient spike sorting. Journal of Neuroscience Methods, 215(1), 29–37. https://doi.org/10.1016/j.jneumeth.2013.01.012
Pedreira, C., Martinez, J., Ison, M. J., & Quian Quiroga, R. (2012). How many neurons can we see with current spike sorting algorithms? Journal of Neuroscience Methods, 211(1), 58–65. https://doi.org/10.1016/j.jneumeth.2012.07.010
Quiroga, R. Q. (2012). Spike sorting. Current Biology, 22(2), R45-R46. https://doi.org/10.1016/j.cub.2011.11.005