Data overview

Kristiina Uusna
Truth or lie?
Published in
4 min readJan 13, 2020

In this section we will look more deeply at the structure of the EEG dataset. The dataset is acquired from the Bag-of-Lies Database. [1]

We have 35 unique subjects and EEG data is available for 30 of them. Each subject (user) has different number of trials (runs) where the maximum number of runs for any user is 10 and minimum is 3. The data set has a total of 325 annotated recordings consisting of 162 lies and 163 truths with an overall 271430 observations. The collected data for each run does not have a fixed number of measurements (values vary from 59 to 4 867). The data set consists of 13 feature values of EEG readings and their quality measure together with the deception variable (truth). The variable truth is well balanced in our data set, as mentioned above, with 57% of the observations being truthful and 43% of the observations being of a deceptive nature.

First 11 observations from dataframe

In this project we build models with 13 different features: F3, F4, F7, F8, FC5, AF4, FC6, O1, O2, P7, P8, T7, T8, which are all the available EEG channels. Below we plot the mean values for each feature recording. All means are negative. We can see that feature F4 has the lowest value (-134.2) and feature P8 the highest value (-100.7).

Every feature has their own quantitative quality value recorded for all measurements. Below we plot the mean values for each quality measure. We can see that feature F8 has the lowest mean quality value (1178) and AF4 has the highest value (4398).

Now let’s look density plots to see the distribution for our feature values and their qualities. We can see peaks when the feature values are between -150 and -100. Highest density value belongs to features F8 and F7. The density plots indicate a presumable normal distribution for each feature value with approximately a joint mean. Therefore, when conducting classifications with there feature values as covariates, we can expect methods with Gaussian density assumption to be more stable, assuming the normality is not overturned with multidimensionality.

When looking at density plots for feature qualities we can not see certain peak quality regions, meaning that most of the feature quality values do not have an assured quality. There is still an accumulation in the quality measure at 0, mainly with the F8, F7 and T7 features. The overall quality measures range from 0 to over 8000.

Correlation coefficients between feature values are represented in the correlation matrix below. We can see strong correlation between features O2 and P8. Also between features FC6 and O2, and FC6 and P8. As the causations of our models is not of uppermost importance and our models are mainly fixated on classification accuracy, the strong correlation between the variables (indicating the possibility of multicollinearity) is mainly ignored.

When we look correlations between quality variables then all of them are close to zero, meaning qualities are not as highly correlated as most of their corresponding feature values. We can therefore assume a general unity in the measurement accuracy with no dependence between the measurement quality.

References:

[1] V. Gupta, M. Agarwal, M. Arora, T. Chakraborty, R. Singh, M. Vatsa Bag-of Lies: A Multimodal Dataset for Deception Detection, IEEE Conference on Computer Vision and Pattern Recognition Workshop on Challenges and Opportunities for Privacy and Security , 2019.

--

--