Simple P300 classifier on open data

Published in

Impulse Neiry

11 min readJan 22, 2020

Recent neurotech developments like Neuralink and FastMRI initiative brings us to explore the current state of Brain Computer Interfaces (BCI) field as may be most of our communication with machines will occur only by thought very soon. The most straightforward and simple way to implement the BCI is to use electroencephalography — just record electrical potential differences off the scalp and predict internal brain processes by using collected data.

To start with we rely on P300 Event Related Potential (ERP)) — kind of Evoked Potential (EP) that is related to decision making and stimuli discrimination. We visualize electroencephalogram (EEG) signal, explore ERP structure, build some classifiers and measure their performance.

There are plenty of tutorials on how to build a simple to moderate BCI solution as well as underlying brain mechanics explanation: BCI2000, Backyard Brains articles and Swartz Center video lectures to name a few (check them out). But all of them were made significant time ago and most of them use Matlab which nowadays is not widely used by Data Scientists, so we provide a tutorial relying on Python infrastructure.

All the code used in this post is available in our repository.

To acquire BCI dataset you may call your friends, make a VR game about Racoons and Demons, record EEG of all your colleagues and write a scientific article about it (I’ll tell you this story another time). Much easier way is to download existing open dataset previously recorded in a laboratory.

From Data Scientist point of view, P300 is just a surge in EEG at a certain time in certain brain areas. There are plenty of ways to trigger it, for example, if you concentrate on one stimulus and it gets activated (by changing the colour, shape, brightness or moving a bit) at a random moment. Here is the way this process was implemented at dawn of BCI:

The general scheme is as follows: a person sees a few stimuli (usually from 3 to 7). He chooses one of them and focuses his attention on it (a good way to do it is to count activations) then stimuli get activated one by one in random order. Knowing activation time we could observe chunk of EEG right after the activation and guess if there was P300 present or not. As a person was focusing only on one particular stimulus, therefore there should be only one spike. Thus the current BCI is able to choose one of a few options (letters to spell, actions in a game, etc.). In case we have more than 7 options to choose from they are placed on a grid and choice is reduced to row+column choice. The video above shows the classical implementation of this approach called P300 speller that is used by paralyzed people to communicate.

A visual component used to record today’s dataset was derived from a famous game space invaders. The interface used to look like this:

Brain Inviders gameplay — Brain Invaders gameplay

In reality, it’s the same P300 speller, with the only difference of letters being replaced by game aliens. Also, gameplay video and technical report are available.

All in all, we have the data that we can load from the Internet and it contains 16 channels of EEG and one channel of markers indicating starts of both target (chosen by the player) and empty (the rest) stimuli activations.

Most of BCI datasets were recorded by neurophysiologists and these guys usually don’t care about compatibility thus formats are quite diverse: from different versions of .mat files to "standard" formats .edf and .gdf. The main thing you need to know is that you don't want to parse all this on your own or work with it directly. Luckily a group of enthusiasts NeuroTechX wrote loaders for some of the open datasets. These loaders are part of moabb project claiming to standardize BCI solutions.

Loading raw dataset

At this point, we acquired RawEDF structure containing EEG recordings. This structure came from mne package, it's usually utilized by biologists to process data: there are methods to filter, plot signals, store additional annotations and many more. But let’s not go this way, as the package interface is not stable and insufficiently documented (e.g. we use 0.17 package version instead of current 0.19 cause latest version fails to load our dataset).

What we take from mne package structure is channels labels in the 10–20 system. This is an international system for electrode positioning on human head created to fix relations between brain zones and electrodes positions. Below you can see electrode positions in 10–10 system (higher density electrode settings compared to 10–20) and electrodes used in our dataset marked red.

First for each participant we extract arrays of raw EEG 16 sec long and all marks (this is just another channel in signal).

At this stage, we preserve maximal length continuous signal to reduce edge effects on the filtering stage.

Filtering and epoch slicing

For in-depth overview of preprocessing and classification approaches I do recommend excellent overview by neuroscience maîtres. Also, another overview of Deep Learning methods emerged this year. As follows from this works typical pipeline in the BCI field is

The minimal pipeline includes three steps:

decimation
filtering
scaling

We achieve this by using sklearn paradigm of Transformers and Pipelines to achieve extensibility. Custom transformers performing operations needed are implemented in transformers.py file then sampled into the pipeline.

Decimation
For some mystic reason, I’ve encountered in a number of articles decimation step performed by straightforward sample drop like eeg = eeg[:, ::10]. This is obviously incorrect so we use standard scipy implementation using anti-aliasing filter under the hood.

Filtering
Here we also use scipy filters. Particularly 4th order bandpass Butterworth filter applied forward and backward (filtfilt) resulting in 8th order filter without phase shift. Cut-off frequencies are 0.5-20 Hz which is a standard approach for EEG filtering.

Scaling
We use channelwise StandardScaler (subtracts mean and divides by std) which fits to all the dataset. Actually, at this point, we introduce data leak, cause Scaler uses test data to fit its params but on a large enough datasets this effect is insignificant. Scaling is performed channelwise to preserve an ability to mix signals from different sources (and thus having different magnitudes) for example electrodermal activity (EDA).

For now, we don’t make special filtering of artefacts like eye blinks or muscle contractions and leave this to our next article.

After assembling our preprocessing pipeline we apply it to continuous EEG signal and slice it on so-called epochs. Epoch is a chunk of EEG in which specific brain processes start and finish (usually 0.5–2 seconds long) — in our case, it’s the period of time from stimuli activation to 900 ms after. We could shrink epoch time to 600 ms for example but we preserved some margin just to be sure.

We have 16 channels and after applying decimation have 50 Hz signal rate thus one epoch is shaped (16, 45) cause it's 45 time samples during 900ms on 50 Hz.

Labels provided with this dataset are only binary — they denote starts of target (1) and empty (0) stimuli epochs.

Eventually, we’ve made Pytorch-style dataset with each item storing one person’s game: preprocessed epochs and binary labels. We will use Pytorch facilities in later posts to train neural networks.

Using it we can test each person separately (for example make cross-validation) as well as test transfer (aka calibration-less or pre-trained) classifier.

Exploring and visualizing data

First, let’s have a brief look on the filtered EEG data we’ve obtained.

Here is one continuous part of filtered EEG signal — it looks more or less like a noise.

Now if we scale our view to one-second interval we see that signal has some structure — in this case some raise from 400 to 600 ms from epoch start. That is what we are searching for — an evoked P300 potential.

We have about 35k epochs over all participants, with approximately 1300 to 1750 for each. This is caused by different success rate in killing aliens in the game. As it said in dataset announcement it has overall binary class ratio 1/5 because they used 6 by 6 table and every time only one of six stimuli was chosen. Later we’ll return to that in metrics discussion.

Now we could look at the target and non-target signals in general

On the chart on the left, we can see that target stimulus gives more intensive mean response. Also, you can see non-specific response around 180ms in both signals but the target is stronger. Then the target has typical hump from 250 to 500ms that is proverbial P300 response.

With such a drastic contrast between signals one could consider classification task to be easy, but if we look at this chart with stds added we realize that both signals are quite noisy, so it’s not that effortless.

(Actually, we haven’t been completely honest with the visualisation on these charts, as number of empty signals is 5 times bigger so it is averaged among bigger samples number and looks more smooth. But this doesn’t help a lot cause stds have the same magnitudes)

Furthermore, it’s worth taking a look on average signals of one person.

Here we can see that both signals got more amplitude, this brings us to the previous remark on averaging, and as mentioned there — fewer samples mean more rough result.

Another meaningful characteristic is that particular person’s P300 have a slightly different shape than the global average. Interpersonal variability of neurophysiological reactions is high, we will see this factor influence later at the classification stage. Although intrapersonal variability is high too and depends on the stress level, mood, fatigue, etc.

Here we can see channelwise signals. Places of signals correspond 10–10 system highlights demonstrated above.

We can see how response changes with the electrodes’ position on the scalp. For example in Fp1, Fp2 channels, two negative peaks are clearly expressed before positive. Also, some channels have two distinct positive peaks, others have only one and something in between for the rest of channels.

Different electrodes capture different parts of response and contribute differently to prediction quality — we will measure it in later posts using mutual information and stepwise regression.

Having enough electrodes one could even interpolate potentials on the whole head and plot voltage map at any moment. For 16 electrodes this map is barely accurate but it gives a qualitative understanding of process occurring. Note that mne by default expects signal to be measured in volts but we've already applied scaling so now absolute values of the signal aren’t meaningful.

Classification

Further on, it’s time to apply some ML techniques to our data to solve the main problem of selected stimulus discrimination.

Classifiers we use are Logistic Regression, SVM and some other techniques based on ERP specific correlation analysis taken from pyriemann package. Details of these methods can be clarified in packages documentation and papers, one thing to note is that several competitions winning solutions are using them.

The most popular neurointerface scheme is “calibration+prediction”. Calibration means that person has to provide some data samples with known labels, usually subject is told to focus on pointed stimulus for some time. Then this data is used to train the classifier and only after that system is able to predict person’s choices. This approach has an obvious disadvantage of need to stare at predefined objects. This is not a problem for medical usage but can be fatal to gameplay experience.

To test our classifiers in this mode let’s perform cross-validation over one subject’s epochs. Accuracy metric is insufficient in this case due to an unbalanced dataset (constant predictions baseline is 5/6 ~ 83%), so I prefer to keep track of triplet precision-recall-f1.

To observe performance on the whole dataset we average cross-validation score of every person. In general best models’ results are high enough compared with what we have at Neiry “in the field” (note that this dataset was recorded in a lab).

Returning to initial task one could wonder how to answer the final question of multiclass stimulus choice task (by the way it is balanced). To solve it a number of activations gets fixed (for example 5 activations of 6 stimuli) and all stimuli get activated in random order. Then we acquire 30 epochs and for each stimulus probabilities of its epochs to be target are summed. Stimulus with maximum score is acknowledged as a target. We will implement this approach in one of our future posts on sufficient dataset cause this one has only binary labels.

The second scheme of prediction is called transfer learning which means predicting epochs of unknown people. The thing is when we train only on one person’s epochs, in fact, we overfit to particular peak form thus the prediction of this kind of peaks have relatively good quality but this classifier doesn’t recognize general P300 concept.

We conduct two experiments — train one classifier on one person and predict different five, then increase training set to 10 persons (different than testing five) to ensure an increase of prediction ability.

So f1 increased 0.23 to 0.4 for the best classifier (in both cases it happened to be Logistic Regression with the same regularization term).

This means prediction ability increased from unsatisfying to acceptable. On the basis of our experience classifiers of this quality may result in multiclass task accuracy around 75%.

Finally, note that this example is a bit primitive that could be ensured by a high level of regularization in Logistic Regression — channels are strongly correlated (this issue may be solved for example with embeddings).

Conclusion

Today we’ve observed P300 evoked potential and assembled simple pipeline for neurointerface. I recommend to checkout notebook with code yourself (it’s available at our repository) and experiment with visualizations and classifiers.

Having a basic knowledge of EEG signal processing methods we will be able to examine this field in depth in later posts: advanced preprocessing methods and plenty of neural network architectures.

To be continued…