From where the EDA is Coming?
Exploratory Data Analysis (EDA) is an one step of the Data Science project life cycle.
Other stages of a Data Science includes:
- Problem definition (aka Framing the problem)
- Data mining (aka Collecting data)
- Data preparation (aka Preprocessing the data)
- Exploratory data analysis (aka Exploring the data)
- Feature engineering
- Model building
- and Model valuation. (aka Consolidating the results)
Different people name the same process with different wordings as follows.
What is Exploratory Data Analysis (EDA)?
“Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.” — IBM
What is the important of doing an EDA?
- Uncover patterns and relationships in data
- Identify outliers and errors
- Test assumptions for further analysis
- Validate chosen statistical techniques
- Ensure results are relevant to business goals
- Identify the right questions to ask of the data
Additional Benefits of EDA:
- Understanding Stakeholder Needs: Align data analysis with stakeholder goals through KPIs and metrics.
- Uncovering Hidden Patterns and Relationships: Reveal trends, correlations, and dependencies for informed decision-making.
- Identifying Data Quality Issues: Ensure data integrity and reliability by detecting missing values, outliers, inconsistencies, and duplication.
- Enhancing Decision-Making: Empower stakeholders with a comprehensive data understanding for problem-solving, strategy, and resource allocation.
- Mitigating Risks: Identify and address potential risks by analyzing historical data and patterns.
- Enabling Agile Decision-Making: Facilitate rapid data exploration and adaptation to market dynamics.
- Improving Communication and Collaboration: Visualizations promote shared understanding and collaboration between analysts, stakeholders, and business professionals.
Most common python libraries that can we used for EDA
- Pandas Profiling
⚠️
pandas-profiling
package naming was changed. To continue profiling data useydata-profiling
instead!
Documentation: https://github.com/ydataai/ydata-profiling
Python pip package: https://pypi.org/project/pandas-profiling/
Example:
Step 01:
# Install the package
! pip install ydata_profiling
Step 02:
# Import required dependencies
import numpy as np #to generate a 2D array
import pandas as pd #to create a DataFrame
from ydata_profiling import ProfileReport #to do the pandas profiling
Step 03:
# Generate a random 2D array and put it inside a DataFrame
df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
Step 04:
# Do the profiling
profile = ProfileReport(df, title="Profiling Report")
Step 05:
# View the created pandas profiling
profile
2. Sweetviz
python pip package: https://pypi.org/project/sweetviz/
sample dataset: https://raw.githubusercontent.com/dataprofessor/data/master/penguins_cleaned.csv
Step 01: Install the package
# install sweetviz
! pip install sweetviz
Step 02:
# import required packages
import pandas as pd
import sweetviz as sv
import IPython
from sklearn.model_selection import train_test_split
Step 03:
# read the dataset
penguins = pd.read_csv('https://raw.githubusercontent.com/dataprofessor/data/master/penguins_cleaned.csv')
Step 04:
# separating X and y
X = penguins.drop('species', axis=1)
Y = penguins['species']
X
Y
Step 05:
# Create an EDA using sweetviz
analyze_report = sv.analyze(penguins)
analyze_report.show_html('analyze.html', open_browser=False)
Step 06:
# View generated EDA report using IPython
IPython.display.HTML('analyze.html')
Side-by-Side Comparison of Train vs Test Set
Step 07:
# Data Split using 80/20 Split Ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train
X_test
Step 08:
Note: If you are trying this in a Jupyter notebbok, it;s better to make the “open_browser=True”
# create a Comparison Report
compare_report = sv.compare([X_train, 'Train'], [X_test, 'Test'])
compare_report.show_html('compare.html', open_browser=False)
Step 09:
# view the generated comparison report
IPython.display.HTML('compare.html')
References:
🌐 Follow me on LinkedIn: https://www.linkedin.com/in/chanakadev/
👨💻 Follow me on GitHub: https://github.com/ChanakaDev