# Unveiling DBDP’s new Exploratory Data Analysis Module

The Digital Biomarker Discovery Pipeline is an open source platform that provides accessible code to analyze wearable health data. This walkthrough will go over the Exploratory Data Analysis module which we’ve added this summer.

You can follow along to this tutorial by cloning the Github repository here. (Here’s a tutorial to clone a Github repository)

# What is Exploratory Data Analysis?

Exploratory data analysis (EDA) is the first step in the data analysis process¹. It can be defined as a process “to analyze and investigate data sets and summarize their main characteristics.”² EDA involves using a number of tools to get a picture of what’s happening within a dataset. More explanation of EDA can be found here. When done right, EDA can highlight new aspects of a dataset, inform researchers where to focus their effort, and lead to new conclusions. EDA can be divided into two categories, univariate and multivariate. Univariate EDA focuses on one variable within your dataset at a time, while multivariate EDA focuses on looking at the relationships between variables. When breaking a dataset down into multiple folds of training data, EDA can be performed on each fold of the training set to ensure similar data distribution across all folds.

# What does the DBDP EDA Module look like?

We’ll go through what our EDA module contains, its potential utility for a researcher, and an example of the module being used to analyze a wearable device dataset. This walkthrough does require some basic Python programming experience, but you can check out this resource for an introduction to Python for data science and some of the libraries we’ll be using.

The EDA module on DBDP currently has seven categories of functions:

- Preliminary EDA
- Generation of covariance matrices
- Missing value analysis
- Distribution visualization
- Data visualizations
- Clustering
- Principal components analysis

Each category provides researchers with methods to understand their data and get a sense of its characteristics through various lenses. This could be considered analogous to looking at a painting by standing back 20 feet, and then coming closer and looking again at the colors, then looking again at the shapes, then again at the brush strokes.

# Preliminary EDA

Preliminary EDA gives the most basic overview of your data. We will use methods from the Pandas library, an open source Python library focused on data analysis and manipulation to obtain different statistical descriptions of our dataset. When you’re first importing a data file into your Jupyter Notebook, Pandas can be a good place to start.

Let’s look at an example of performing preliminary EDA on a reference dataset. The reference dataset that we’re using is the STEP dataset from the DBDP’s digital health data repository (DHDR). The STEP data includes heart rate and skin tone data collected from 7 different wearable devices with each row of data categorized by activity type. This dataset was initially collected to investigate inaccuracies between wearable devices and includes data taken from common wearables like Fitbit and Apple Watch. The results of this study can be found here.

The first step is to read in our dataset. Replace the ‘filename’ variable with the path to the dataset you’re using. It should look something like this:

From here, using the remainder of the Preliminary EDA section is straightforward.

Next we use the Pandas functions len(), .info(), and .describe() to get some basic information about our dataset:

In this case, len() tells us that our dataset is over 230,000 lines. Using .info(), we learn that our data has 9 different columns, 7 of which are floats, 2 are objects, and 1 is an int (check this out for a refresher on data types). We use .describe() to obtain common statistical summaries (e.g., mean, standard deviation) of the float and int columns within our dataset.

The Preliminary EDA sub-module also provides examples of the generation of correlation plots using the seaborn library. Correlation plots indicate the level of correspondence between different variables in a dataset. The correlation coefficient is a measure of the linear relationship between two variables. It should be noted that correlation plots can only be run on numerical columns. Let’s run the correlation plot code chunk on our STEP dataset:

This tells us that some variables are highly correlated, like the ECG and Apple Watch data (correlation coefficients close to 1, see figure legend of correlation coefficient heatmap), and some aren’t very correlated at all, like the ECG and Skin Tone data (correlation coefficients close to 0). This should make sense. The ECG and Apple Watch columns of our dataset are measuring the heart rate of the same participant at the same time, so we can expect them to be correlated. The Skin Tone column, however, is a number 1–6 indicating the skin tone of the participant. This column should have no correlation with heart rate.

You’ll also notice there’s no correlation data between Miband and Apple Watch. This is a result of missingness in our dataset, which we’ll explore later in this walkthrough.

Scatterplots are another method to get insight into the correlation between different variables. Here we compare our six other wearable devices to the ECG data in order to assess their accuracy:

# Covariance matrices

Covariance is a measure of how much two variables vary together. While correlation tells us how closely related two variables are, covariance tells us the strength and direction of that relationship. Covariance also isn’t standardized. This can give you a sense of what interesting relationships may exist in your data that you can explore further.

Let’s take a look at the covariance that exists within our example STEP dataset. To run the covariance method, drop all columns that are non-numeric. It should look something like this:

This code chunk utilizes the StandardScaler() function from the sklearn library to standardize our data to a normal distribution. This process allows easy comparison between the covariance values in our data.

One conclusion we can draw from this is that the ECG and Apple Watch data often vary together (as indicated by higher values), while the ECG and Fitbit data rarely vary together (as indicated by values closer to 0). If we’re viewing the ECG heart rate as the true heart rate, then the covariance matrix indicates that the Apple Watch does a better job tracking heart rate fluctuation than the Fitbit does.

# Missing Value Analysis

Missing Value Analysis can tell a researcher where potential gaps may exist in their dataset. This can be important when assessing the completeness of a dataset. The missing value analysis section of the EDA repository utilizes the ‘missingno’ package.

Let’s look at how we can use the missing value analysis sub-module to analyze our STEP dataset.

In this figure, the gray represents data that is present and the white represents data that is missing. The columns are the variables, and the rows are the observations of the data (indexed by time). From this figure, we can see that our ECG data is relatively complete across all participants, while our Apple Watch, Garmin, Fitbit, and Miband data have significant gaps. In this example, assessing the amount of white space in a column allows researchers to see how frequently a wearable health device measures heart rate.

# Plot Distribution

The plot distribution sub-module gives researchers a sense of the balance of different variables within their dataset. The EDA Module has methods that allow a researcher to determine whether the dataset is imbalanced, plot the distributions of the values of each variable, and perform outlier analysis.

Let’s use this sub-module to visualize the distribution of our ‘Skin Type’ column. By setting y=‘Skin Type’ we get the following output:

Here we can see that skin type 1 was the least represented while skin type 3 was the most represented. This can give researchers a sense of any potential bias that may exist in their dataset.

We can also break down our analysis of variable distribution by columns. By filling in the columns we want to separate our data by and the variable we want to analyze, we get the following distributions:

# Data Visualizations

Data visualization is the largest EDA sub-module. This sub-module allows researchers to understand their data in many ways by providing different visual representations of their dataset. The data visualizations can be manipulated to better fit a particular dataset and create the most useful visualizations. The visualizations in the EDA sub-module utilize the matplotlib library. The data visualization sub-module contains seven different visualization tools that a researcher can use to understand their wearable data. Let’s go through the different visualizations with our STEP dataset:

**Histogram**

Suppose you want to understand what values exist in your dataset and how often they appear. Functions like “.value_counts” can provide a numerical description, however a visual representation can be useful if the dataset contains a large number of unique values and if you wish to get a sense of their distribution. The histogram visualization can help a researcher to understand the frequency of values of a variable in a dataset. By changing the column of the dataset that is selected we can change which variable’s frequency is being visualized. Here, we visualize the frequency of the Apple Watch column. The heart rate measurement from Apple Watch shows a roughly symmetric, bimodal distribution with one peak at 80 BPM and the other around 100 BPM. This is likely a result of different activity types among participants. The data clustered around 80 BPM may be when participants are resting, while the data clustered around 100 BPM may be when participants are active.

**Box Plot**

The box plot visualization can help visualize the statistical distribution (center, spread, and skewness) of a dataset. Here, we visualize 8 different columns within our dataset. This visualization shows us that the ECG data contains the largest heart rate spread, while the Miband data contains the least.

**Leaf Plot**

The leaf plot visualization can help a researcher understand how a dataset is distributed by categorizing it into numerical groups. This can be useful when looking for spiking within your dataset. Here, we use leaf plots to categorize our data into rest vs active periods.

**Run Chart**

A run chart allows the researcher to see how their data changes with time and is useful for time series data. The run chart can be created by specifying the column that you want to plot with regard to time. Here, we see how the ECG data changes over time in the STEP dataset.

**Scatter Plot**

The scatter plot visualization allows a researcher to understand how two variables are related. Let’s use the scatter plot method in the EDA module to see how the Apple Watch and ECG columns are related in our dataset. As you can see from the resulting figure, there is a positive correlation between the heart rate measured from the Apple Watch and the ECG.

**Multivariate Chart**

The multivariate chart allows a researcher to understand how two variables are related while considering another categorical or numerical variable. Here’s an example of us looking at the association among wearable data while distinguishing between skin tones. More data points plotted between two wearables indicates a more association.

**Bubble Chart**

The bubble chart visualization can help a researcher understand how two variables relate while considering a third numerical variable. Let’s use the bubble chart method in the EDA module to see how ECG and Apple Watch data is related while accounting for the participant’s skin type.

# Clustering

There are multiple methods for clustering data in the EDA Module. This allows for data to be grouped based on shared attributes using unsupervised machine learning. Unsupervised machine learning is a branch of machine learning focused on categorizing and analyzing datasets lacking labels. Learn more about unsupervised machine learning here. The two types of data clustering that the EDA clustering sub-module contains are K-means clustering and hierarchical clustering.

Let’s look at a k-means clustering of the ECG data. A k-means clustering is a clustering of data by shared attributes into k subgroups. For more information about how to choose the number of clusters, check this out. To run this code, you need to specify how many clusters you want to group your data into. Here, I chose 3 to indicate low, average, and elevated heart rate. We can see across different activity types, our data falls into different clusters:

# Principal Component Analysis (PCA)

PCA allows researchers to get a sense of the dimensions (variables) that explain the greatest variance in their data. This can be useful when researchers are deciding which variables they want to focus their effort on and to lower the dimensionality of the data. The EDA module provides code for PCA that utilizes methods from the sklearn library and visualizations from matplotlib.

Using that same subsection of the STEP dataset with just the ECG columns, we can create a PCA Plot. A 2-D PCA plot uses the two ‘principal components’ — or dimensions that explain the greatest variation in the data — to visualize high dimensional data in fewer dimensions while maximizing the likelihood of seeing subgroups in the data. In the ECG dataset, we can see that the PC1 separates the data into two groups (rest vs active), and PC2 helps to distinguish between breath vs type points.

The new Exploratory Data Analysis module from the DBDP gives researchers an accessible code outline to do preliminary analysis of their wearable health data. This can give researchers insight into the relationships that exist within their data, where to focus their research, and allow them to come to new conclusions. For more explanations and walkthroughs of the modules within the DBDP, check out our other blog posts.

I’d like to thank the Duke Big Ideas Lab for their support of the DBDP and Dr. Md Mobashir Hasan Shandhi, Dr. Jessilyn Dunn, Karnika Singh, and Hayoung for their support in writing and revising this walkthrough.

[1]: Komorowski, Matthieu & Marshall, Dominic & Salciccioli, Justin & Crutain, Yves. (2016). Exploratory Data Analysis. 10.1007/978–3–319–43742–2_15.

[2]: IBM Cloud Education (2020). Exploratory Data Analysis, https://www.ibm.com/cloud/learn/exploratory-data-analysis