Srijit Mukherjee
Published in

Srijit Mukherjee

Exploratory Data Analysis (EDA) in R — A Comprehensive Guide

Humans are visual animals. EDA, therefore, plays a major role in your understanding of data and to make better decisions.

This EDA R course, which developed out of the course I taught here, will help you understand the data from a visual perspective, which will prompt you to make better data decisions. This is always the first step in understanding data.

EDA is not only related to visualization, but it is an integral part of it. It is an adventure with the data. EDA helps you ask the right question. Your every statistical analysis starts with a question in your mind. EDA gives you a headstart when you haven’t collected the data yourself and you are likely to have no domain knowledge on that topic.

Let’s get started.

There are two broad categories of data:

Quantitative (Numerical)

Qualitative (Categorical / Factor)

Our Learning Path in EDA will move ahead like this.

Content

Q_UEDA — Quantitative Univariate EDA

Ql_UEDA — Qualitative Univariate EDA

Q_Ql_EDA — Qualitative vs Qualitative EDA

Ql_Ql_EDA — Quantitative vs Qualitative EDA

Q_Q_EDA — Quantitative vs Quantitative EDA

Data

We will use the data ais from the “DAAG” package in R. You can find the details of the data here.

install.packages("DAAG")
library(DAAG)
data = ais
head(data, n = 3)
str(data)

Variables

(We are only interested in five for demonstration purposes)

  1. hg (hemoglobin concentration, in g per decaliter)
  2. ht (height, cm)
  3. wt (weight, kg)
  4. sex (a factor with levels f m)
  5. sport (a factor with levels B_Ball Field Gym Netball Row Swim T_400m T_Sprnt Tennis W_Polo)

Q_UEDA — Quantitative Univariate EDA

Our Univariate Quantitative Data: hg (hemoglobin concentration, in g per decaliter)

1. Summary

summary(data$hg)

2. Box Plot

boxplot(data$hg, main = toupper("Boxplot of Hemaglobin concentration"), ylab = "Hemaglobin concentration", col = "red")

3. Histogram

hist(data$hg, xlab = "Hemaglobin concentration", probability = TRUE, , main = "Histogram of Hemaglobin concentration")

4. Kernel Density

d <- density(data$hg)
plot(d, main = "Kernel density of Hemaglobin concentration", xlab = "Hemaglobin concentration")
polygon(d, col = "red", border = "blue")

Ql_UEDA — Qualitative Univariate EDA

Our Univariate Qualitative Data: sport (a factor with levels B_Ball Field Gym Netball Row Swim T_400m T_Sprnt Tennis W_Polo )

1. Frequency Distribution Table

table(data$sport)

2. Vertical Bar Plot

barplot(table(data$sport), main="Count of participants in different sports for study", xlab="Sports",ylab="Count", border="red", col="blue", density=10)

3. Horizontal Bar Plot

barplot(table(data$sport), main="Count of participants in different sports for study", xlab="Sports",ylab="Count", col = "darkred", horiz = TRUE)

4. Pie Chart

pie(table(data$sport), labels = levels(data$sport))

Q_Ql_EDA — Qualitative vs Qualitative EDA

Our Qualitative vs Qualitative Data:

sex (a factor with levels f m )

sport (a factor with levels B_Ball Field Gym Netball Row Swim T_400m T_Sprnt Tennis W_Polo )

Central Idea:

We will compare the corresponding Univariate EDA.

  1. Frequency Table
  2. BarPlot

1. Contingency Table (Frequency Table Comparison)

sex_vs_sport = data[,12:13]
table(sex_vs_sport)
xtabs(~ sex + sport, sex_vs_sport)
# This code chunk will also work. This chunk will give an insight to multiple categorical variables.

2. Vertical Bar Plot (Bar Plot Comparison)

barplot(table(sex_vs_sport),
main = "Sports Participation Distribution by Sex",
xlab = "Sport",
col = c("red","green")
)
legend("topleft",
c("Female","Male"),
fill = c("red","green")
)

3. Beside Bar Plot (Bar Plot Comparison)

barplot(table(sex_vs_sport),
main = "Sports Participation Distribution by Sex",
xlab = "Sport",
col = c("red","green"),
beside = TRUE)
legend("topleft",
c("Female","Male"),
fill = c("red","green")
)

Ql_Ql_EDA — Quantitative vs Qualitative EDA

Our Quantitative vs Qualitative Data:

hg (hemoglobin concentration, in g per decaliter)

sport (a factor with levels B_Ball Field Gym Netball Row Swim T_400m T_Sprnt Tennis W_Polo )

sex (a factor with levels f m )

Central Idea:

We will compare the corresponding Univariate EDA w.r.t the Qualitative Data.

  1. Summary Comparison
  2. Box Plot Comparison
  3. Kernel Density Comparison

1. Summary Comparison

hg_vs_sport = data[,c(4,13)]
hg_vs_sex = data[,c(4,12)]
by(hg_vs_sex, hg_vs_sex$sex, summary)

2. Box Plot Comparison

boxplot(hg~sport,
data=data,
main="Different boxplots for each sport",
xlab="Sport",
ylab="Hemaglobin concentration",
col="orange",
border="brown"
)

3.1 Kernel Density Comparison ( for Sport )

library(ggplot2)
ggplot(hg_vs_sport, aes(hg, fill = sport)) + geom_density(alpha = 0.2)

3.2 Kernel Density Comparison ( for Sex )

library(ggplot2)
ggplot(hg_vs_sex, aes(hg, fill = sex)) + geom_density(alpha = 0.2)

Q_Q_EDA — Quantitative vs Quantitative EDA

Our Quantitative vs Quantitative Data:

ht (height, cm)

wt (weight, kg)

1. Plot

plot(data$wt ~ data$ht , data,
xlab="Height", ylab="Weight",
main="Scatter Plot")

2. Scatter Plot

library(car)
scatterplot(data$wt ~ data$ht , data,
ylab="Weight", xlab="Height",
main="Enhanced Scatter Plot")

Thank You for stopping by.

If you like this article and if you think that this will be helpful for the world, please do clap, and share, so that it helps the medium algorithm to reach the people, who have started their journey in data science.

Srijit Mukherjee.

Thanks, Subhrajyotyroy for your valuable suggestions.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store