Pre-Processing of Muse 2 EEG Data

A first journey into DIY Brain Computer Interfaces, part 3

Published in

Building a bedroom BCI

8 min readSep 3, 2021

Fig 1. Designed by rawpixel.com / Freepik

Disclaimer: This series of blog posts is written to make myself aware of the details and concepts in the field of BCIs and neuroscience, as I go through my very first own BCI project. Right now, I really have no idea how this project will turn out.
Therefore, note that this series of blog posts should not be used as a definitive guide for your own journey. Rather, I hope you will take inspiration from my first journey, and maybe not make the same mistakes as I might do ;)
Update: In the summer of 2022, I started a new series of blog posts with updated, more advanced information, and way better results than achieved in this series of blog posts. Click here to go to part 1 of that series.

In part 2, I described how I acquired the data from my Muse 2 EEG device. In that blog post, I also described the experiment I want to conduct: online classification of motor imagery of lifting a bottle with the left and right hand, which I eventually want to use as input to control a simple game.

In this third part of the series, I will describe all the steps I have taken for the pre-processing part of my BCI project. We will start with some outlier removal, and then move to transformations of the data with ICA, PCA, and with engineering new temporal and frequency based features.

Before we start, I want to add a quick additional disclaimer: as the data I acquired from the Muse device is already filtered into the specific brain wave signals (gamma, beta, alpha, theta, delta) rather than raw EEG data, I often was not sure if a pre-processing part was needed and was correctly applied. I also did not do any additional filtering by myself. Later, I might try to work with raw EEG data myself, trying to use filters and use the pre-processing steps when they make more sense to use them, but for now I stick to the brain wave data from Muse and try to work with, and learn about, the pre-processing steps anyway.

Now, let’s dive in!

The code corresponding to the project described in this series of blog posts can be found at this Github repository. Specifically, we first discuss step 2 which can be found in the Python file 2_outliers.py and the folder step_2_preprocess. Then, we discuss feature engineering which can be found in the Python file 3_features.py and the folder Step_3_feature_engineering.

Outlier Detection and Missing Values

To recap, our data consists of 20 features, which are the signal levels of the 5 different brain waves (gamma, beta, alpha, theta, delta), for each of the 4 sensors.

The first step of the pre-processing pipeline is outlier detection and removal. As I did not really know which outlier detection algorithm I should use, I searched Google for some papers about outlier detection for EEG data. I stumbled across an article from 2011 where the authors used a mixture model.

The mixture model is a distribution based outlier detection algorithm, where generally, a combination of distributions are used to fit to the data. Each datapoint which is considerably far away from the fitted combination of distributions is considered an outlier. It can be said that the mixture model is a more complex version of Chauvenet’s criterion, which is used in the same manner, but with only one single distribution to fit the data.

In our case, EEG data would be difficult to fit properly with only one distribution. With a combination of distributions, we might get a more accurate fit. Thus, we will use a mixture model.

The distributions of a mixture model will be weighted to maximize the product of the probabilities of observing our feature values. The number of distributions is chosen by the user. As the authors of the paper I found used 3 distributions, I just followed and choose 3 as well.

Then, a datapoint would be considered an outlier when the probability of occurring in the fitted combination of distributions was lower than 0.0005. For my data, the resulting number of outliers for a couple of test datasets was on average around 10, with a maximum of 40 outliers in a dataset with 2000 datapoints. In figure 2 below, we see that points with a low probability are mostly at the beginning when contact with the skin may have been lost for a bit, and at the high peak at the end.

Figure 2: An example of outlier detection using a mixture model. Above, we see the original data. Below, we see for each datapoint the probability of occurring in the fitted combination of distributions. Notice that for peaks, the probability of occurrence is lower.

So, outlier detection did not detect too much outliers in my data, which I think is also logical: I use the already filtered brain wave data, so most outliers would already be filtered out. However, I do have to mention that before starting the pre-processing part, I manually looked at all my data, and discarded all experiments which had a considerable period of zero-values, which indicated that the sensors had lost contact with the skin for a bigger period. If I would have included those, the number of outliers for those experiments would maybe differ.

Last thing to mention in this part: I linearly interpolated the missing values I had (which I had created by deleting the outliers).

PCA and ICA

Now, we move onto data transformation / feature engineering. More specifically, we will first apply dimensionality reduction with PCA and ICA.

For PCA (Principal Component Analysis), a quick explanation would come down to that we want to find vectors that represent lines (or hyperplanes if we have more than two attributes) and order them in terms of how much variance in the data is explained.

PCA is often used to reduce the dimensions of the data, which can be used for visualization: reducing the dimensions to 2 or 3 makes it possible to plot data in a 2D or 3D graph. However, we will use PCA as means to add more features to our data, which a machine learning model might decide to be useful for the predictions. The big disadvantage of applying PCA for feature engineering is that we often lose the ability to interpret our models as we now have a new space with new values that are not immediately interpretable by a domain expert.

For PCA, we have to decide how much components we want to add to our data. Remember, the first component of PCA explains the most variance, and the last component explains the least variance. To decide, we take a look at the amount of explained variance for each component in figure 3 below, and choose the component for which the explained variance doesn’t decrease that much anymore (referred to as the ‘elbow method’). I choose for 4 components, but 5, 6 or 7 components could also be chosen as this method isn’t that strict.

Figure 3: Explained variance for each component of PCA.

ICA (Independent Component Analysis) is often explained as: a method to separate independent sources linearly mixed in several sensors. For instance, when recording electroencephalograms (EEG) on the scalp, ICA can separate out artifacts embedded in the data (since they are usually independent of each other) [Copied from ICA for dummies].

So, when acquiring raw EEG data from our sensors, we could apply ICA to remove artifacts and acquire the actual brain signal data we’re interested in. In our case, again, we already have the pre-processed brain wave data, thus applying ICA wouldn’t be necessary for artifact removal. However, just for experimental reasons, I applied FastICA (a variant of ICA) with 20 components (which is just the amount of features I had before).

Temporal and Frequency Based Features

Lastly, we move on to feature engineering in the time and frequency domain.

For temporal data, it is often difficult to make predictions based on a single datapoint. Rather, we would like to chop up the data in small pieces, and try to make predictions based on the group of datapoints in a small piece. This is exactly what we will do right now: we chop up our data into ‘windows’, and calculate some metric (in the time or frequency domain) to summarize the small set of datapoints in that specific window.

We then only keep 1 datapoint per window: the summarized datapoint. Thus, with this method, we add a lot of new features, but we decrease the amount of instances in our data. To limit this decrease of instances, we allow for some overlap between windows. Generally, 50% overlap is chosen. As the size of the windows, we choose 1, 2 and 3 seconds.

In the time domain, various metrics can be used to summarize the data in each window. We will calculate mean, standard deviation, maximum, minimum, median and slope, where slope is calculated as the slope of the line we get when fitting a linear regression to the data in the window.

In the frequency domain, we will compute the amplitudes of the frequencies in each window using a Fourier Transformation (FT).

In the frequency domain, we can engineer multiple features as well. An obvious one would be to add the amplitude of each of the relevant frequencies we found in a window.

Also, we can engineer features which summarize the found frequencies in the windows:

We will add the frequency with the highest amplitude as an indication of the most important frequency in that window.

2. Also, we can calculate the weighted signal average of the frequencies by multiplying each frequency with their amplitude, take the sum of these results, and then divide this by the sum of the amplitudes.

3. Lastly, we can calculate the power spectral entropy. We compute the power spectral density first (squaring the amplitude and normalizing by the number of found frequencies), then we normalize the values to a total sum of 1 such that we can view it as a probability density function, and lastly compute the entropy via the standard entropy calculation.

As a large window will give us a huge amount of features (due to the fact we add the amplitude of each found frequency), we will only calculate features in the frequency domain for the smallest window size we chose, which is 1 second.

Conclusion

By now, we have created a staggering amount of 584 features, while we started with only 20! Some of the features are displayed in figure 4 below. Due to the sampling of data with the windows, the number of instances have decreased to around 1 datapoint per 1.5 seconds (to recap, it originally was 10 samples per second).

Figure 4: Some of the features of our dataset right now.

I’m not sure (yet) if all added features are needed, useful or even correct.. but we may find out in part 4, where we will apply various machine learning models and evaluate their performance!

Click here to go back to part 1, or part 2. Part 4 is available here.
Any feedback for this project and blog post series is welcome!

Pre-Processing of Muse 2 EEG Data

A first journey into DIY Brain Computer Interfaces, part 3

Outlier Detection and Missing Values

PCA and ICA

Conclusion

Written by Tim de Boer