Decoding the Human Brain

--

“The brain is the most outstanding organ. It works 24/7, 365 from birth until you fall in love.” ― Sophie Monroe, Afflicted

„How does visual influence the functionality of the brain? “. For example, when you see a stranger who looks like your best friend, you will have a positive image of the stranger, on the other hand, if the stranger looks like the person who was bullying you in high school, you might have a negative image of this stranger. In this Neuroscience project, we try to unlock the mystery of the brain.

1. Background

Brain Decoding is a popular topic in neuroscience. The purpose is how to reconstruct an object that came from a sensory system using brain activity data. This project is based on Kaggle project, the goal is to predict the category of a visual stimulus presented to a subject from concurrent brain activity. The brain activity is captured with an MEG device which records 306 timeseries at 1KHz of the magnetic field associated with the brain currents. The categories of visual stimulus for this project are two: face and scrambled face. A stimulus and the concurrent MEG recording are called trial and thousands of randomized trials were recorded from multiple subjects. The trials of some of the subjects, i.e., the train set, are provided to create prediction models. The remaining trials, i.e., the test set, belong to different subjects and they will be used to score the prediction models. Because of the variability across subjects in brain anatomy and in the patterns of brain activity, a certain degree of difference is expected between the data of different subjects and thus between the train set and the test set.

2. Dataset

The training data consist of 9,414 trials (MEG (Magnetoencephalography) recordings and the class labels (Face/Scramble), from 16 subjects). About the faces, they were of famous people and unfamiliar people. The Scrambled faces were created by a 2D Fourier transform, which the phases were permuted before transforming back to the image space and masked by the outline of the original face image [2]. The test set comprises 4,058 MEG recordings from 7 subjects without class labels. For each subject approximately has 580–590 trials. Each trial consists of 1.5 seconds of MEG recording (starting 0.5 second before the stimulus starts) and the related class label, Face or Scramble. The data were down-sampled to 250 Hz and high-pass filtered at 1Hz. 306 timeseries were recorded, one for each of the 306 channels, for each trial. The trials of each subject are arranged into a 3D data matrix (trial x channel x time) of size 580 x 306 x 375.

The 306 sensors are grouped, three at a time, in 102 locations. At each location two orthogonal gradiometers and one magnetometer record the magnetic field induced by the brain currents. The magnetometer measures the z (radial) component of the magnetic field, while the gradiometers measure the x and y spatial derivative of the magnetic field[1].

3. Challenge in Dataset

The main competition goal for the project is simple — Binary classification on 2 classes with MEG data as input, however, after further reading on related journal, the problem can be explorer further in the sense of generalization on how human brain is react to face image. During the group discussion at early stage, the question of why the project was focus on face image other than others object is arise as bias will be introduce to the experiment due to people react differently to face, for example, if the present face image look likes one of the family members of the subject (e.g., Mother), which may cause (slightly?) different brain’s activities in compared to others because of subject’s feeling. This problem is further formalized in [1] which refer as structural/functional difference of individuals' brain. This gives additional complexity on the statement with respect to across subject training or signal subject training.

Across/Single Subject Modelling

According to journal [1,2], in the world of neuron science, there exist a practical phenomenon that the prediction accuracy on modelling brain’s activities on single subject trail usually outperform experiment (~15%) [1] that is across subject. Single subject trail is referring to modelling from data set that is collected from the same subject for both training and testing while across subject follow the normal practice on mixture of dataset from different sample. It is easy to observe that across subject will have bigger variance among the data set, however it is not the only source of difficulty.

There is one corner stone assumption that roots in the field of machine learning on the relationship between the train and test set distribution that they are assumed to be coming from the same distribution (i.e., source of training/testing data is generated from the same distribution). But as you see from the across subject setting together with the structure/functional variation, it is logically deduced that both testing and training set are at least somehow in different which affect the quality of prediction result. However, the magnitude of effect is still unclear.

Binary classification vs Generative Model

The across subject setting is related to the fundamental question on system identification vs imitate system operator in the field of learning theory which the former one is trying to estimate the underlying stochastic dependencies (i.e., to infer the distribution on brain’s activities, signal on face vs scramble for every individual) while the later one is trying to approximate the behavior of brain activities for individual. System identification is much harder than imitating as it is ill-posed naturally, but it is more appropriate for this project as it would be interesting to discover the underlying activity on how brain is response to human face.

4. Experimental Work

4.1. Main idea

Fig 1. Autoencoders framework [3]

Follow from the previous direction on generative modelling, we have chosen Variational AutoEncoder(VAE) [3] model to conduct experiment on the MEG data set. VAE is one of the fundamental generative models from deep neuron network which is based on a non-supervisor learning specialize in feature extraction (Shown as Fig 1.).

The main idea on this special arranged structure is to let the neuron network learn know to reconstruct back the original input by extracting import feature from the input. The extraction task is done by the encoder part which it compresses the original input into an output(latent) space that is with smaller dimension then the original one, this output is then feed in back to decoder which expand to original space by reconstruction from the latent variable and compare it to the original one for improvement.

While above summary gives a summary on how autoencoder works, the variational part is focus on the latent space distribution which it introduces additional constraint on it, namely Kullback–Leibler divergence (KLD) (KL divergence — in short it is use as metric on measure the distance between distribution) [4,5].

In view of the encoding part, it tries to map those individual’s input to latent space which should be as general as possible (in the sense of with-in class) for the decoder to reconstruct the output better. While the encoder is performing inductive inference onto the hidden space, the KL divergence comes in to help on maximizing the distribution distance for different classes in conceptual sense. In partial, those latent distribution distances are minimizing with respect to standard Gaussian distribution. (Why? 1. Latent distribution is unknow for each class. 2. Maximizing distance between distribution is unbound)

The loss function is attached here for reference:

Loss = BCELoss( x_hat — x) + (-0.5 * Sum (1+v — m*2 — exp(v))

Where the first term represents the reconstruction loss and the second term represents the KL divergence loss)

4.2 Experiment Processes

  • Data preparation

The MEG data is already pre-processed on de-noising by the competition host and exported in MATLAB format, as a result, there is not much that has to do with the pre-processing phase.

Fig 2. 102 location map (3 sensors per point)

Those data were collected from an experiment conducted with 23 subjects and result in 13,472 trails and further separated into training set (16 subject, 9419 trails) and testing set (7 subjects, 4058 trails). During each trail, subjects were present with either a normal or scrambled human face image in random for 1 seconds and their brain activities were collected by 306 MEG sensors located in 102 location (3 channels for each location) distributed in different part of the head. Fig 2. shows how the location of each sensors group.

Each of the subjects has performed with 588 trails with a balanced face and scrambled image. Thus, those data are result in the shape of (13472 x 306 x 375).

  • FFT Transformation

Those MEG data are transformed into frequency domain, the following plot visualized the frequency of 3 channel data from the same location for face image and scramble image shown as Fig 3 and 4.

Fig 3. Random Sample 1
Fig 4. Random Sample 2

The above diagram shows that more of those significant frequency is concentrated from 0 up to around 50(Hz) and no special pattern that can be identified to distinguish between face and scramble image in terms of frequency domain.

The shape peak that appeared on all three channels at around 50Hz is suspicious, which may be a result of noise or numerical error. Further investigation found that all transformed signals got the same pattern. This finding will be used in modelling with or without that shape peak signal. The resulting data dimension has been reduced to (13472,306,188).

  • Min-Max Transformation

As the value after FFT transform become exceptionally low in magnitude, to avoid numerical instability, Min-Max transformation is applied after FFT along the last axis.

  • Model Input
Fig 5. visualization of random samples

In usual case, data will be reshaped into row/column vector for VAE input, however for this experiment setting, as each location is come with 3 channels of input, so it is interested in convolution layer can be adopted just as those image processing technique. So, those data are further reshaped to (13472,3,102,188). The following image shows a visualization of random samples from both 2 classes (Shown as Fig 5).

4.3. Model Process

  • 2 Phase Process

The modelling process will be divided into 2 steps, the first step will involve training up the VAE to produce latent variable from model input. After that, those latent vectors will be used as input to nuSVC(Nu-Support Vector Classification) for classification purposes.

(Image request: input data -> VAE -> SVM -> 2 classes)

First phase - Hyperparameters Tuning

There are three hyperparameters identified for VAE with convolution layer: Kernel Size, Latent dimension, and number of layers for encoder and decoder. The parameters exploration process is done manually based on the VAE loss. The following table lists the parameter that will be used in the first phase.

Second phase Hyperparameters Tuning

The second phase tunning is relied on sklearn GridSerach that conduct with nuSVC with the following parameters:

KL divergence loss
Reconstruction loss
Image Reconstructed
Latent Variable distribution over training

5. Result

In summary, the experiment doesn’t provide a promising result with the best classification accuracy only at 52% (nuSVC on all parameters tunning), in compared to the method from [1], which got around 70% accuracy. The result for this experiment is not better than a random guess on 2 class classification. Nevertheless, the result on each phase of the model is still presents in the following:

The loss of reconstruction is rebased with 0.14(around 14% pixel is nonblack), as the image is mostly with black pixel which should not be candidate.

First phase model (VAE)

From the above summary table, it is observed that all KL divergences loss is zero (or very close to zero) which indicates that for both classes, they are sharing the same distribution on latent space. This observation can be confirmed by the latent variable histogram plot in the previous page during training. However, from the histogram plot, it is shows that although they are sharing the same distribution on values, but the mode of them is not the same, with this finding, it is decided to continuous on the exploration with kernel size 4 with latent dimension 102.

Latent Variable exploration

The training set of input is passed through the VAE to generate a 13,472 x 102 latent space dataset. The follow plot shows the modal (most common) and descriptive statistics on each latent space dimension.

From the following plot, it clearly shows a distinct pattern between face and scramble on the most common value for latent space in compared to mean values. This founding is a further exam with different rounding on the latent variable data set to see how the pattern is going to change.

By rounding those data to 1, 2, 8 d.p , it can verify that the mode pattern appears after the data round to 8 d.p.

The observation may suggest a new direction how the face and scramble latent variable can be interrupted on column space.

6. Conclusion And Future Work

The attempt to use VAE to discover the latent space distribution of Face and Scramble image for classification failed due to both classes may share the same hidden space distribution with slight difference. However, there are still tunning that can be done to access if the VAE can capture the distribution by:

  1. Change the model by using a full connected layer instead of convolution layer, the convolution may be best for image processing on edge finding etc. But due to lack of pattern inside the MEG visualized image, it may fail and result in poor performance.
  2. No to use weight sharing on convolution layers
  3. Further study on the mode on latent variable.

7. Reference

[1] DecMeg2014 — Decoding the Human Brain

[2] A parametric empirical Bayesian framework for the EEG/MEG inverse problem: generative models for multi-subject and multi-modal integration

[3] MEG decoding across subjects

[4] AntixK/PyTorch-VAE: A Collection of Variational Autoencoders (VAE) in PyTorch.

[5] How to Calculate the KL Divergence for Machine Learning

Appendix I. Roles in the project

Team Members: ChengHan Chung(chenghan@ut.ee), Chan Wai Tik (waiti84@ut.ee)

Model Implement: Chan Wai Tik

Parameter Optimization and code review: ChengHan Chung

Presentation and Blog: Chan Wai Tik, ChengHan Chung

--

--