Retrieving Audio Information from Brain Responses with Computer Vision

Adolfo Ramirez-Aristizabal
Labs Notebook
Published in
10 min readSep 15, 2022
125 channels of brain data for 1 second of listening to ‘Bonobo — First Fires’

The near future of wearable technologies consists of sensors that interpret information from your brain, much like a smartwatch interprets someone’s health and emotional states. Such technologies are developing with the promise to seamlessly integrate users into the metaverse, as well as facilitating the inclusion of users with limited motor capabilities. Given the historically extensive investment in Computer Vision (CV) research, a handful of recent studies have leveraged this to provide a proof-of-scalability. This approach increases the efficiency of modeling Electroencephalogram (EEG) data i.e., voltage potentials of brains recorded on the scalp. This is done by shifting from what was traditionally thought of as the processing of time-series data to formatting EEG data into images as inputs to state-of-the-art CV models. Here the EEG data of focus will be that which corresponds to acoustic stimuli such as music listening. Such a shift in data processing newly affords an immediately practical way to scale up applications in Brain Computer Interface (BCI) technologies.

Acoustic Information Retrieval

When processing EEG data, the first step is to clean the signal of artifacts and environmental noise. Cleaning methods include typical time-series processing for the purposes of isolating the signal of interest, which in this case is brain activity. After cleaning, the next question at hand asks whether that information can be used to predict what people are thinking or how their brain signals are correlated to some stimulus e.g., music or speech.

EEG data collection varies in how many experimental parameters are imposed for a desired response. This spans anywhere from highly parameterized experiments where participants are shown a ~1 second vocalization 100 times at specific presentation intervals and tasking the participant to track visual stimuli. To then simply instructing a participant to stare at a blank screen while they listen to a song.

Whether your data collection paradigm includes a handful or many experimental controls, the act of then processing those brain signals to gather insights falls under the umbrella of Information Retrieval, and specifically Acoustic Information Retrieval for brain responses corresponding to acoustic stimuli.

Visualization of 1 minute length single channel EEG data being processed by Deep Neural Networks.

Feature Extraction for EEG in ML/DL

Given that EEG data is a collection of voltage potentials over time across recording channels located on the scalp, the instinct of researchers has been to process that data as a time-series. Feature extraction is often needed to facilitate Machine Learning (ML)/Deep Learning (DL) modeling. For example, you can simply pass EEG channel data one by one to models. Lets say that you have a system with 64 electrode channels, this would mean that you would now have 64 time-series per participant and per stimulus condition as input to the model.

If having this many repeated measures is too much, then you can also create a grand average across all recording channels. Perhaps it is also useful to consider that certain channels are more relevant than others as they may be located in relevant brain regions. Clustering methods such as k-means can then take all your EEG channels and help you organize them into clusters based on their scalp topography. With this you can now simply pass to the model the EEG channels that are in specific clusters, such as in the frontal cortex or occipital.

Using data of specific channels based on their scalp locations can be arbitrary, therefore clustering methods can consider the amplitude of the signal or energy at specific frequencies. You can also get rid of channel data and decompose it into a matrix of independent components. Or if you prefer a reduction in data dimensionality, a researcher can select components that explain the most variance. The above-mentioned techniques can also be used as the signal is processed in the frequency domain where the signal is now frequency over time.

For ML models the feature extraction process can be exhaustive, as for every one feature that is extracted it can follow a combination of the processing steps previously outlined. It is also up to the researchers to validate how many and what features are necessary. On the other hand, DL approaches offload the extraction of features to the neural network layers that encode the input into a condensed representation. But even with a DL approach, many researchers still find it necessary to transform it into a more accessible representation. DL architectures used to process EEG inputs formatted as activity over time include subclasses of Recurrent Neural Networks (RNNs), which explicitly learn temporal dependencies of their inputs. Other classes of architectures such as Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), Transformers and so on, only care that the input is formatted as a 1 or 2-dimensional input vector. Unlike with RNNs, these other architectures learn spatial invariant relationships of their inputs.

Music classification from brain responses using CNNs.

EEG Studies

Despite the dependence on feature extraction, previous approaches have come a long way in retrieving information from brain responses. For example, researchers have found ways to optimize deep brain stimulation systems used on patients with Parkinson’s disease to lessen the side effects and improve the battery life of their brain interfaces. They collected electrical signals with invasive implants directly on the brain. That data was then processed as single channel inputs to 1-dimensional CNN layers. This let the model learn to feature extract the signal itself and then use it as input to an RNN model. Performance was at 88% accuracy as the model recognized when a patient was pressing a button, reaching for a target, or speaking. Furthermore, a study by Lawhern et al., (2018) developed EEGNet, a model that uses 2-dimensional CNNs to process EEG data better than out of the box state-of-the-art algorithms. Such a method takes advantage of EEG channel topography as it formats the data to be processed as a channel by time matrix and shows that it can generalize to various data collection procedures.

In the category of acoustic stimuli, Stober et al., (2014) presented participants to 24 different 100 ms long sinusoid rhythms originating from tribal cadences. Without depending on feature extraction of the brain signals, their model demonstrated it could recognize which specific tribal cadence participants were listening to at 24.4% performance. When classifying speech stimuli, a study by Moinnereau et al., (2018) showed 83.2% classification. While depending on feature extraction, their RNN architecture was tasked to recognize from brain signals when someone is listening to the English vowels (~0.5 sec) ‘a’, ‘i’, and ‘u’. Yu et al., (2018) took this further by classifying 8 different types of (~10 sec) vocalizations at ~81%. Interestingly, their analysis showed that without relying on feature extraction the performance was ~61% but performance jumped ~20% when depending on extracted features from the stimulus. These studies have since become precursors for the explicit Computer Vision approaches used by newer studies.

Examples of EEG as images used explicitly in Computer Vision. Left is an EEG image (125 x 125) of its voltage without feature extraction. Right is an EEG image (63 x 125) of its spectral information per channel extracted and formatted into a rectangular image input.

EEG as Images

Classification

The previous examples of processing EEG data with DL approaches showed a step towards a fully end-to-end procedure. In some cases, researchers understood that the strength and practicality of 2-dimensional CNNs would be of use to fit their data but were not able to develop end-to-end approaches. But why is it important for these models to be end-to-end?

Prior studies have been able to develop strong proof-of-concepts while not needing for their modeling to be end-to-end. But for the purposes of scaling these insights into usable applications, many of these studies fall short in replicability, efficiency, and naturalistic use context. The more parameterization and validation steps in either the data collection, feature extraction, or data modeling makes studies harder to reproduce outside the ‘chef’s kitchen’. This puts into question whether such findings may be scaled into consumer use cases. Therefore, if a study depends more on careful feature extraction, use cases narrow. We can also see how this may affect efficiency of applications if the data, much like the food at pop-up Michelin star restaurants, can only be handled at such a highly specialized scale. Lastly, a core concern is whether any methodology can allow for the data to be more naturalistic and better match scenarios of everyday experiences.

Sonawane et al., (2021) addresses many of the previously mentioned concerns by comparing performance of modeling procedures that treat their EEG data in the time-domain vs into a type of spectral image. They collect data from participants passively listening to 12 familiar songs for ~ 2 mins per song. Given ~1 second of their EEG data, their strongest model could recognize what song participants were listening to at 84.96%. Performance dropped off drastically to ~9% when the EEG data was not presented with its extracted frequency information.

This was taken a step further in another study where the researchers developed end-to-end models for music recognition, much like a Shazam app but for your brain (Ramirez-Aristizabal et al., 2022). The researchers explicitly studied CV modeling procedures, as the architecture for their main model (~88.69%) followed that of AlexNet, a historically state-of-the-art image processing model. They also demonstrated the capability for transfer learning (~93.05%) with an ImageNet pretrained ResNet-50, which is commonly used in industry applications for image classification. The strongest result showed that the data did not need feature extraction steps and high performance was achieved simply by being formatted as a [Channel x Samples] matrix, which the models would interpret as a grayscale image. Furthermore, the dataset used in this study came from brain responses to 10 unfamiliar songs of 4 min length. Making the data used a tougher test case along those dimensions, which bodes well as a naturalistic scenario of music listening.

Reconstruction

Moving beyond classification, other studies have sought to recreate the audio signal participants listened to using their EEG as input to a model. This has required the mapping of some input to an image target. The target image for these types of procedures is typically the time-aligned Mel-Spectrogram image of the audio. Ofner & Stober (2018), developed a model to reconstruct the music spectra from participants’ brain responses. Such a method was successful in establishing a proof-of-concept that music information beyond the simpler classification case could be retrieved from brain responses, but it was dependent on feature extraction inputs from the stimuli along with more complex training procedures.

Another study building from this developed deep 2D CNN regressor models in an image-to-image translation task (Ramirez-Aristizabal et al., 2022). Such a method explicitly borrows CV methodology used to denoise images, with the underlying modeling assumption that the EEG ‘images’ are a noisy version of the music spectra. Despite such an assumption not having much neuroscientific basis, the predictive power of the models successfully reconstructed the Mel-Spectrograms of the songs. Furthermore, those reconstructed spectra were then turned into listenable waveforms and presented to new participants to discriminate in a two-alternative match-to-sample task at ~85% classification performance. What this means is that the quality of the spectra reconstructions was strong enough to allow for others to listen to music from someone else’s brain and discriminate it.

Computer Vision for Efficient and Naturalistic Applications

Earlier approaches towards acoustic information retrieval have shown us what is possible to extract from someone’s brain signal. Consequent studies that have taken a CV approach have since then demonstrated a proof-of-scalability. In terms of efficiency, it was shown that end-to-end classification is possible without many trainable parameters. The Sonewane et al., (2021) study’s best model required 1 feature extraction step while their model having 1,678,156 trainable parameters. On the other hand, the Ramirez-Aristizabal et al., (2022) classification study demonstrated end-to-end with a model of 179,132 trainable parameters. This bodes well for deployment purposes where such an application needs less processing and can be contained with less memory storage needed. More importantly, using different datasets, both studies open a window into how a Shazam app for your brain could work efficiently using CV.

Naturalistic music listening scenarios require the stimuli to be complex which entail them to contain vocals, overlapping genres, digital/acoustic textures, and be of a song’s full length. This contrasts from other studies not using CV methods that depend on short, repeated, and simple acoustic stimuli. What allows for this methodology to handle brain responses to long audio recordings, is that unlike RNNs which explicitly learn the temporal dependencies between time points, CV architectures learn spatial invariant features. This makes it possible to learn things ‘frame by frame’. For example, second 10 of brain response to song #1 would be the input aligned to the target Mel-spectrogram of the original song #1 at the 10th second. Thus, putting the focus of resolution to a ‘frame by frame’ level facilitating models to map brain data to acoustic stimuli and outputs to be an interpretable example, like a song someone can recognize.

Figures of EEG data were generated using the publicly available NMED-T dataset (Losorelli et al., 2017) from the Stanford Research Data collection.
https://exhibits.stanford.edu/data/catalog/jn859kj8079
Latest research demonstrating reconstruction of music from brain responses.
https://arxiv.org/pdf/2207.13845.pdf

Following this article will be a series of Python notebook tutorials guiding you step by step on using Computer Vision for brain responses to acoustic stimuli.

--

--

Adolfo Ramirez-Aristizabal
Labs Notebook

Associate Principal Researcher at Accenture Labs — Digital Experiences