Unveiling Hidden Insights: A Quick Start Guide to Exploratory Data Analysis (EDA)

8 min readMar 8, 2023

The three-stage exploratory data analysis

Exploratory Data Analysis is like rummaging through your fridge to see what ingredients you have to work with. Sometimes you find something unexpected and end up making a masterpiece, and other times you just make a sandwich. — Unknown

If you’re an aspiring data chef, you know how crucial it is to know your ingredients before you start cooking. Exploratory data analysis (EDA) is the secret sauce in the data science lifecycle that helps uncover hidden patterns, anomalies, and relationships that could be the difference between success and failure in any data analysis project. But where do you start? What questions should you ask? Fear not, in this article, we will explore the world of EDA, unveiling the key steps to follow and the crucial questions to ask. So let’s dive into the world of data exploration!

The Three-Stage EDA:

There is no single correct way to carry out an EDA, as it is as much of an art as a science. However, with practice, you can develop and refine your own methodology. To simplify this process, I have summarised the broad goals of EDA into the following stages:

Understanding your Data Structure and Quality
Understanding the shape of your data
Uncovering patterns and relationships in your data

Let’s dive into these in a little more detail.

Understand your Data Structure and Quality

EDA provides an opportunity to get familiar with the dataset you’ll be working with. This includes identifying the different features, checking for missing values, and assessing the overall quality of the data. Note that throughout this article, I will use the term ‘features’ rather than ‘variables’ but consider them interchangeable in the context of this article and use whichever you are more comfortable with.

Understand the data structure

First, we observe the structure of our data from a high level, including the number of observations and features, their types, and how they are organised.

Categorise features

Next, let’s categorise our features. This will inform us how we visualise and process them. We can define each feature as:

Categorical: These represent a set number of categories and can be either nominal or ordinal.
Continuous: Numeric data that can take an infinite range of values.
Other: You may have features such as free text. While these are technically categorical data, you may treat as individual values, rather than as members of a category.

This is also an excellent time to look at data types. If you have a categorical feature that is ordinal and represented by numbers, you’ll want to represent it with an integer rather than a float. Another example is numeric data in a string or character field.

Identify key features

Once we know the types of features we have, let’s identify the key features.
We can return to our problem formulation and business problem to assist with this.

With a supervised machine learning method, we will look into what we will use as our ‘target’. This feature represents the outcome we are trying to predict (also known as the ‘label’, ‘dependent variable’ or ‘response variable’). For example, if we are building a model to predict house prices, our target will be the price.

With an unsupervised method such as clustering, we can think about what features we’ll use as input to the clustering.

Assess data quality

Next, we identify missing values in each field and think about how we may handle them before analysis. Many machine learning models do not work with missing values, so we’ll need to deal with these in data pre-processing. It’s also a good idea to consider why they are missing, as this might be avoidable in the data collection step.

Understanding individual features — Univariate Analysis

Once we understand what features we have in the data and have an idea of how we might deal with them, we can start looking at the shape of the data they contain.

One of the primary tools we will be using is visualisation. EDA often involves creating visualisations of the data to help identify patterns and relationships that may not be immediately apparent. Visualisations can help identify correlations between variables, trends over time and help identify potential outliers.

Summary statistics

First, we will summarise the characteristics of individual features. We can create summary tables of the following descriptive statistics.

Central Tendency (mean, median, mode)
Variability (standard deviation, variance, range, interquartile range)

Here are some code examples of summary statistics in Python and R respectively:

import pandas as pd

# Load data from a CSV file
data = pd.read_csv("data.csv")

# Get summary statistics
summary = data["column_name"].describe()

# Print summary statistics
print(summary)

# Load data from a CSV file
data <- read.csv("data.csv")

# Get summary statistics
summary <- summary(data$column_name)

# Print summary statistics
print(summary)

Plots

We can also produce plots of the distributions to get a visual feel for the features and understand:

Distribution type
Skewness

For categorical features, you can produce frequency bar charts. For continuous, you can plot a histogram.

Here are examples in Python and R:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Plot a histogram with a KDE
sns.histplot(data, x='column_name', kde=True)
plt.title('Distribution of Data')
plt.xlabel('Column Name')
plt.ylabel('Frequency')

# Show the plot
plt.show()

# Load data from a CSV file
data <- read.csv("data.csv")

# Plot a histogram with a KDE
library(ggplot2)
ggplot(data, aes(x=column_name)) +
  geom_histogram(aes(y=..density..), binwidth=0.5, color="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") +
  ggtitle("Distribution of Data") +
  xlab("Column Name") +
  ylab("Density")

Identify outliers and anomalies.

Outliers and anomalies in the data may require further investigation or correction. By understanding the distributions, we can more easily identify any anomalies.

Repeat this for each of your features one by one, and note any anomalies or noteworthy observations that may be useful in your analysis.

Uncovering relationships in your data — Bivariate Analysis

After we have looked at our features in isolation, we can start looking at the relationships between pairs of features. This will help uncover relationships between the variables. This is a part that will give you a chance to generate hypotheses for further analysis.

Correlations between features and target

This will help us understand the relationship between the features and the outcome we are testing for. This will inform which features are of importance to include in our model (feature selection). An excellent way to do this is to calculate the correlation coefficient between your target and every other feature and plot a bar chart of the correlation.

Examples:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data from a CSV file
data = pd.read_csv("data.csv")

# Calculate correlations between all pairs of variables in the dataset
corr = data.corr()

# Get correlations fo target and sort by descending order
corr = corr['target'].sort_values(ascending=False)

# Create barplot using the Seaborn library.
sns.barplot(x=corr.index, y=corr)

# Rotate x-axis labels for readability
plt.xticks(rotation=90)

# Show the plot
plt.show()

# Load data from a CSV file
data <- read.csv("data.csv")

# Calculate correlations
corr <- cor(data)

# Get correlations for target variable
target_corr <- corr["target", ]

# Sort correlations by descending order
target_corr <- sort(target_corr, decreasing=TRUE)

# Create barplot
library(ggplot2)
ggplot(data.frame(variable=names(target_corr), correlation=target_corr), aes(x=variable, y=correlation)) +
  geom_bar(stat="identity", fill="blue") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title="Correlations with target variable")

Correlations between other variables

Correlated or redundant features in our data can adversely affect our models, and we typically want to reduce redundancy. The best way to visualise this is a correlation heatmap. Another useful type of plot is a ‘pair plot’ which summarises the relationships between pairs of features and plots them in one command.

Examples:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data from a CSV file
data = pd.read_csv("data.csv")

# Calculate correlations between all pairs of variables in the dataset
corr = data.corr()

# Create heatmap of the correlations using the Seaborn library. 
# The annot=True parameter adds numeric annotations to the heatmap,
sns.heatmap(corr, annot=True, cmap="coolwarm")

# Show the plot
plt.show()

# Load data from a CSV file
data <- read.csv("data.csv")

# Calculate correlations between all pairs of variables in the dataset
corr <- cor(data)

# Create heatmap using the corrplot library. 
# The type="upper" parameter displays only the upper triangle of the heatmap
# (since the correlations matrix is symmetric)
library(corrplot)
corrplot(corr, type="upper", method="circle", tl.col="black", tl.srt=45)

# Show the plot

We should take note of any groups of features that exhibit a correlation with each other and revisit this when we are building our model. Removing redundant features can often improve model performance.

Formulating Hypotheses

EDA involves formulating hypotheses and testing them against the data. This process involves asking questions about the data and testing whether the data supports or refutes these hypotheses. We can also check for assumptions of statistical models or methods, such as normality or independence.

A useful tool

For Python users, a valuable tool called Pandas Profiling is a quick and efficient way to profile your data. It does many of the above tasks in one shot and can be a huge time-saver. Once you have generated the report, you can open the HTML file in your web browser to view it. The report includes interactive visualizations and tables that allow you to explore your data in more detail.

Example in python:

import pandas as pd
import pandas_profiling as pp

# Load data from a CSV file
data = pd.read_csv("data.csv")

# Generate the report
report = pp.ProfileReport(data)

# Save the report as an HTML file
report.to_file("report.html")

You can also render the report within your Jupyter notebook using the following code:

# Render the report as a Jupyter notebook widget
display(report)

Going deeper

If we want to go deeper, we can do multivariate analysis. While it is difficult for the human brain to conceptualise a relationship between more than two features, we can think of it as a deeper pattern or interaction.

Multivariate analysis is a powerful tool for exploratory data analysis (EDA) as it allows data scientists to examine the relationships between multiple variables simultaneously. Here are some common types of multivariate analysis that can be used in EDA:

Cluster Analysis: Cluster analysis is a technique that groups similar observations or variables into clusters based on their similarity or distance. It can help identify patterns and relationships in the data and segment it into meaningful groups.
Factor Analysis: Factor analysis is a statistical method that identifies underlying factors or constructs that are not directly observable in the data. It can help to identify common patterns and relationships between variables and reduce the number of variables in the analysis.
Principal Component Analysis (PCA): PCA is a technique that reduces the dimensionality of the dataset by transforming it into a set of uncorrelated variables called principal components. This can help to identify patterns and relationships in the data and simplify complex datasets.

Conclusion

By conducting thorough EDA, we can unlock the full potential of our data and ensure we end up with a meaningful analysis or model.

Once we know what’s in the refrigerator, we can cook up a masterpiece…or just make a sandwich.