# Exploratory Data Analysis (EDA) in R — A Comprehensive Guide

Humans are visual animals. EDA, therefore, plays a major role in your understanding of data and to make better decisions.

This EDA R course, which developed out of the course I taught here, will help you understand the data from a visual perspective, which will prompt you to make better data decisions. This is always the first step in understanding data.

EDA is not only related to visualization, but it is an integral part of it. It is an adventure with the data. EDA helps you ask the right question. Your every statistical analysis starts with a question in your mind. EDA gives you a headstart when you haven’t collected the data yourself and you are likely to have no domain knowledge on that topic.

Let’s get started.

There are two broad categories of data:

## Qualitative (Categorical / Factor)

Our Learning Path in EDA will move ahead like this.

# Content

Q_UEDA — Quantitative Univariate EDA

Ql_UEDA — Qualitative Univariate EDA

Q_Ql_EDA — Qualitative vs Qualitative EDA

Ql_Ql_EDA — Quantitative vs Qualitative EDA

Q_Q_EDA — Quantitative vs Quantitative EDA

# Data

We will use the data ais from the “DAAG” package in R. You can find the details of the data here.

`install.packages("DAAG")library(DAAG)data = aishead(data, n = 3)`
`str(data)`

# Variables

(We are only interested in five for demonstration purposes)

1. hg (hemoglobin concentration, in g per decaliter)
2. ht (height, cm)
3. wt (weight, kg)
4. sex (a factor with levels `f` `m)`
5. sport (a factor with levels `B_Ball` `Field` `Gym` `Netball` `Row` `Swim` `T_400m` `T_Sprnt` `Tennis` `W_Polo)`

# Q_UEDA — Quantitative Univariate EDA

Our Univariate Quantitative Data: hg (hemoglobin concentration, in g per decaliter)

# 1. Summary

`summary(data\$hg)`

# 2. Box Plot

`boxplot(data\$hg, main = toupper("Boxplot of Hemaglobin concentration"), ylab = "Hemaglobin concentration", col = "red")`

# 3. Histogram

`hist(data\$hg, xlab = "Hemaglobin concentration", probability = TRUE, , main = "Histogram of Hemaglobin concentration")`

# 4. Kernel Density

`d <- density(data\$hg)plot(d, main = "Kernel density of Hemaglobin concentration", xlab = "Hemaglobin concentration")polygon(d, col = "red", border = "blue")`

# Ql_UEDA — Qualitative Univariate EDA

Our Univariate Qualitative Data: sport (a factor with levels `B_Ball` `Field` `Gym` `Netball` `Row` `Swim` `T_400m` `T_Sprnt` `Tennis` `W_Polo` )

# 1. Frequency Distribution Table

`table(data\$sport)`

# 2. Vertical Bar Plot

`barplot(table(data\$sport), main="Count of participants in different sports for study", xlab="Sports",ylab="Count", border="red", col="blue", density=10)`

# 3. Horizontal Bar Plot

`barplot(table(data\$sport), main="Count of participants in different sports for study", xlab="Sports",ylab="Count", col = "darkred", horiz = TRUE)`

# 4. Pie Chart

`pie(table(data\$sport), labels = levels(data\$sport))`

# Q_Ql_EDA — Qualitative vs Qualitative EDA

Our Qualitative vs Qualitative Data:

sex (a factor with levels `f` `m` )

sport (a factor with levels `B_Ball` `Field` `Gym` `Netball` `Row` `Swim` `T_400m` `T_Sprnt` `Tennis` `W_Polo` )

Central Idea:

We will compare the corresponding Univariate EDA.

1. Frequency Table
2. BarPlot

# 1. Contingency Table (Frequency Table Comparison)

`sex_vs_sport = data[,12:13]table(sex_vs_sport)xtabs(~ sex + sport, sex_vs_sport) # This code chunk will also work. This chunk will give an insight to multiple categorical variables.`

# 2. Vertical Bar Plot (Bar Plot Comparison)

`barplot(table(sex_vs_sport),        main = "Sports Participation Distribution by Sex",        xlab = "Sport",        col = c("red","green") )legend("topleft",       c("Female","Male"),       fill = c("red","green"))`

# 3. Beside Bar Plot (Bar Plot Comparison)

`barplot(table(sex_vs_sport),        main = "Sports Participation Distribution by Sex",        xlab = "Sport",        col = c("red","green"),        beside =  TRUE) legend("topleft",       c("Female","Male"),       fill = c("red","green"))`

# Ql_Ql_EDA — Quantitative vs Qualitative EDA

Our Quantitative vs Qualitative Data:

hg (hemoglobin concentration, in g per decaliter)

sport (a factor with levels `B_Ball` `Field` `Gym` `Netball` `Row` `Swim` `T_400m` `T_Sprnt` `Tennis` `W_Polo` )

sex (a factor with levels `f` `m` )

Central Idea:

We will compare the corresponding Univariate EDA w.r.t the Qualitative Data.

1. Summary Comparison
2. Box Plot Comparison
3. Kernel Density Comparison

# 1. Summary Comparison

`hg_vs_sport = data[,c(4,13)]hg_vs_sex = data[,c(4,12)]by(hg_vs_sex, hg_vs_sex\$sex, summary)`

# 2. Box Plot Comparison

`boxplot(hg~sport,        data=data,        main="Different boxplots for each sport",        xlab="Sport",        ylab="Hemaglobin concentration",        col="orange",        border="brown")`

# 3.1 Kernel Density Comparison ( for Sport )

`library(ggplot2)ggplot(hg_vs_sport, aes(hg, fill = sport)) + geom_density(alpha = 0.2)`

# 3.2 Kernel Density Comparison ( for Sex )

`library(ggplot2)ggplot(hg_vs_sex, aes(hg, fill = sex)) + geom_density(alpha = 0.2)`

# Q_Q_EDA — Quantitative vs Quantitative EDA

Our Quantitative vs Quantitative Data:

ht (height, cm)

wt (weight, kg)

# 1. Plot

`plot(data\$wt ~ data\$ht , data,            xlab="Height", ylab="Weight",            main="Scatter Plot")`

# 2. Scatter Plot

`library(car)scatterplot(data\$wt ~ data\$ht , data,            ylab="Weight", xlab="Height",            main="Enhanced Scatter Plot")`

Thank You for stopping by.

If you like this article and if you think that this will be helpful for the world, please do clap, and share, so that it helps the medium algorithm to reach the people, who have started their journey in data science.

Srijit Mukherjee.

Thanks, Subhrajyotyroy for your valuable suggestions.

--

--