Exploratory Data Analysis (EDA) in Data Science

Chanaka

4 min readJun 18, 2024

From where the EDA is Coming?

Exploratory Data Analysis (EDA) is an one step of the Data Science project life cycle.

Other stages of a Data Science includes:

Problem definition (aka Framing the problem)
Data mining (aka Collecting data)
Data preparation (aka Preprocessing the data)
Exploratory data analysis (aka Exploring the data)
Feature engineering
Model building
and Model valuation. (aka Consolidating the results)

Different people name the same process with different wordings as follows.

What is Exploratory Data Analysis (EDA)?

“Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.” — IBM

What is the important of doing an EDA?

Uncover patterns and relationships in data
Identify outliers and errors
Test assumptions for further analysis
Validate chosen statistical techniques
Ensure results are relevant to business goals
Identify the right questions to ask of the data

Additional Benefits of EDA:

Understanding Stakeholder Needs: Align data analysis with stakeholder goals through KPIs and metrics.
Uncovering Hidden Patterns and Relationships: Reveal trends, correlations, and dependencies for informed decision-making.
Identifying Data Quality Issues: Ensure data integrity and reliability by detecting missing values, outliers, inconsistencies, and duplication.
Enhancing Decision-Making: Empower stakeholders with a comprehensive data understanding for problem-solving, strategy, and resource allocation.
Mitigating Risks: Identify and address potential risks by analyzing historical data and patterns.
Enabling Agile Decision-Making: Facilitate rapid data exploration and adaptation to market dynamics.
Improving Communication and Collaboration: Visualizations promote shared understanding and collaboration between analysts, stakeholders, and business professionals.

Most common python libraries that can we used for EDA

Pandas Profiling

⚠️ pandas-profiling package naming was changed. To continue profiling data use ydata-profiling instead!

Documentation: https://github.com/ydataai/ydata-profiling

Python pip package: https://pypi.org/project/pandas-profiling/

Example:

Step 01:

# Install the package
! pip install ydata_profiling

Step 02:

# Import required dependencies
import numpy as np #to generate a 2D array
import pandas as pd #to create a DataFrame 
from ydata_profiling import ProfileReport #to do the pandas profiling

Step 03:

# Generate a random 2D array and put it inside a DataFrame
df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])

Step 04:

# Do the profiling
profile = ProfileReport(df, title="Profiling Report")

Step 05:

# View the created pandas profiling
profile

2. Sweetviz

python pip package: https://pypi.org/project/sweetviz/

sample dataset: https://raw.githubusercontent.com/dataprofessor/data/master/penguins_cleaned.csv

Step 01: Install the package

# install sweetviz
! pip install sweetviz

Step 02:

# import required packages
import pandas as pd
import sweetviz as sv
import IPython
from sklearn.model_selection import train_test_split

Step 03:

# read the dataset
penguins = pd.read_csv('https://raw.githubusercontent.com/dataprofessor/data/master/penguins_cleaned.csv')

Step 04:

# separating X and y
X = penguins.drop('species', axis=1)
Y = penguins['species']

Step 05:

# Create an EDA using sweetviz
analyze_report = sv.analyze(penguins)
analyze_report.show_html('analyze.html', open_browser=False)

Step 06:

# View generated EDA report using IPython
IPython.display.HTML('analyze.html')

Side-by-Side Comparison of Train vs Test Set

Step 07:

# Data Split using 80/20 Split Ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train

X_test

Step 08:

Note: If you are trying this in a Jupyter notebbok, it;s better to make the “open_browser=True”

# create a Comparison Report
compare_report = sv.compare([X_train, 'Train'], [X_test, 'Test'])
compare_report.show_html('compare.html', open_browser=False)

Step 09:

# view the generated comparison report
IPython.display.HTML('compare.html')

References:

🌐 Follow me on LinkedIn: https://www.linkedin.com/in/chanakadev/

👨‍💻 Follow me on GitHub: https://github.com/ChanakaDev