Mastering Exploratory Data Analysis (EDA): A Comprehensive Python (Pandas) Guide for Data Insights and Storytelling

13 min readNov 28, 2023

Unlocking the Power of Data: Navigating Through EDA Techniques for Meaningful Insights

The Essence of Data Analysis and Exploratory Data Analysis (EDA)

In the universe of data science, Exploratory Data Analysis (EDA) stands as a pivotal stage in understanding and dissecting the story hidden within data. It’s akin to a detective meticulously combing through evidence, seeking patterns, anomalies, and insights that are not immediately apparent. EDA is not just a precursor to more advanced analyses and predictive modeling; it’s a deep and insightful conversation with the data itself.

“There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days.” ~ Eric Schmidt, Executive Chairman at Google

The Journey of EDA

Before we dive into the more sophisticated realms of machine learning and predictive analytics, it’s crucial to establish a solid foundation through EDA. This process begins right after data collection and involves a series of steps:

Data Pre-processing: Cleaning and structuring data into a usable format.
Feature Engineering: Extracting and selecting significant attributes from the data.
Exploratory Analysis: Delving into the data, using various techniques to uncover underlying patterns, relationships, and insights.

Each step is critical. Skipping or rushing through them can lead to skewed results and unreliable conclusions. EDA is about understanding the nuances of your dataset, its strengths, limitations, and the stories it’s ready to tell.

Python: The Chosen Tool for EDA

Python, with its rich ecosystem of libraries and tools, is a popular choice for conducting EDA. The simplicity and power of Python libraries make it an ideal choice for data scientists and analysts across the globe.

In the following sections, we’ll explore the various tools and techniques in Python for effective EDA. We’ll use a hands-on approach, with code snippets to illustrate key concepts and techniques.

Stay tuned as we embark on this exploratory journey through data, uncovering the mysteries hidden within and learning how to listen to the stories data wants to tell us.

Step 1: Toolkits for EDA with Python

To navigate the seas of data analysis efficiently, it’s essential to have the right set of tools. In Python, this translates to a selection of libraries, each specializing in different aspects of Exploratory Data Analysis (EDA).

Core Libraries for EDA:

Pandas: The cornerstone for data manipulation in Python. Pandas offer data structures (like DataFrames) and operations for manipulating numerical tables and time series.
NumPy: A library for numerical computing in Python. It provides support for arrays, matrices, and a large collection of mathematical functions to operate on these arrays.
Matplotlib: A plotting library for creating static, animated, and interactive visualizations in Python.
Seaborn: Built on top of Matplotlib, Seaborn is a statistical data visualization library that provides a high-level interface for drawing attractive and informative statistical graphics.
Scikit-Learn (SkLearn): Although primarily known for machine learning, it also provides various tools for data preprocessing and model evaluation, which can be useful in EDA.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing, model_selection

Environment for EDA:

Jupyter Notebook or Google Colab: Interactive environments like Jupyter Notebook or Google Colab are ideal for EDA. They allow you to write and execute Python code, visualize data, and document your analysis in a single, cohesive document.

Additional Tips:

Ignore Warnings: Sometimes, your code might output warnings that can clutter your notebook. To make your EDA cleaner, you might choose to ignore these warnings.

# To run a Jupyter notebook
!jupyter notebook

import warnings
warnings.filterwarnings('ignore')

These tools form the bedrock of EDA in Python, equipping you with the capabilities to load, manipulate, clean, and visualize your data effectively.

Step 2: Importing Data and Reading Dataset

Data can come from various sources, and Python’s flexibility allows us to handle most of them efficiently. Here’s a guide on how to import data from different sources like files, URLs, and more.

Photo by Mr Cup / Fabien Barral on Unsplash

Importing Data from Files

CSV Files: Pandas provides a simple way to read CSV files.

Excel Files: For Excel files, use the read_excel method.

# For CSV
data = pd.read_csv("path/to/your/file.csv")

# For xlsx
data = pd.read_excel("path/to/your/file.xlsx")

Importing Data from URLs

When your dataset is hosted online, you can directly load it using the URL.

url = "http://your-data-source.com/data.csv"
data = pd.read_csv(url)

Other File Formats:

Pandas can handle various other formats like JSON, SQL databases, HTML, etc. The method generally follows the pattern pd.read_<format>(). Find more at https://pandas.pydata.org/

Reading the Dataset:

Once you’ve loaded the data into a DataFrame, use methods like head(), tail(), and info() to get an initial sense of your data.

# Display the first 5 rows of the DataFrame
data.head()

# Display the last 5 rows of the DataFrame
data.tail()

# Summary of the DataFrame including data types and non-null values
data.info()

By mastering these methods, you’re well-equipped to bring in data from most sources you’ll encounter in real-world scenarios.

Step 3: Analyzing the Data

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Once the data is loaded into Python, the next step is to understand its basic structure and characteristics. With Pandas we turn large dataset into data frames. Here are some fundamental techniques for basic data analysis:

Basic Data Inspection

Shape of the Data: Understanding the size of the dataset (number of rows and columns).
First and Last Few Rows: head() and tail() methods give a glimpse of the dataset.
Data Types and Info: info() method shows the data type of each column and identifies if there are any null values.

data.shape  # Returns (number_of_rows, number_of_columns)
data.head()  # Displays the first 5 rows
data.tail()  # Displays the last 5 rows
data.info()

Basic Statistical Summary

The describe() method provides a statistical summary of numerical columns, including count, mean, standard deviation, min, and max values.

data.describe()

This step is crucial as it provides an initial understanding of the data’s nature and helps in planning further steps in the EDA process.

Step 4: Duplication and Missing Values Management

Dealing with duplications and missing values is crucial in preparing your dataset for analysis.

Handling Duplications:

Identifying Duplicates: Use duplicated() to find duplicate rows.
Removing Duplicates: Use drop_duplicates() to remove duplicates.

# Get duplicates
duplicates = data.duplicated()
#Remove Duplicates
data = data.drop_duplicates()

Managing Missing Values:

Identifying Missing Values: The isnull() method, combined with sum(), helps identify missing values in each column.

missing_values = data.isnull().sum()

2. Handling Missing Data: At this point we start dropping rows/columns with missing values using dropna(). Next we fill missing values with a specific value or a computed value (mean, median) using fillna().

# Filling missing values with the mean
data['column_name'] = data['column_name'].fillna(data['column_name'].mean())

# Dropping rows with missing values
data = data.dropna()

These steps are fundamental in ensuring the quality and reliability of your data before proceeding to more complex analyses.

Step 5: Data Reduction

In data reduction, the goal is to simplify the dataset by removing unnecessary or redundant features. This step is essential for efficient analysis and model building.

Techniques for Data Reduction:

Dropping Irrelevant Columns: Identify and remove columns that don’t contribute to the analysis or predictive modeling.
Handling Constant Columns: Remove columns with constant values as they don’t add any information.

data = data.drop(['irrelevant_column1', 'irrelevant_column2'], axis=1)
data = data.loc[:, data.apply(pd.Series.nunique) != 1]

Criteria for Data Reduction:

Consider domain knowledge, the objective of your analysis, and statistical measures to determine the relevance of each column.

Reducing data effectively can significantly enhance the performance and interpretability of your analytical models.

Step 6: Feature Engineering

Feature Engineering is the process of using domain knowledge to create new features from the raw data, which can significantly improve the performance of machine learning models.

Techniques for Feature Engineering:

Creating Derived Features: Based on domain knowledge, derive new features from existing ones.
Categorizing Continuous Variables: Convert continuous variables into categorical ones, if beneficial.
Extracting Information from Text Data: Extract useful information from text data, like names or addresses.

# Example: Creating a feature for the age of a car from the 'Year' column
data['Car_Age'] = current_year - data['Year']

# Example: Binning ages into groups
data['Age_Group'] = pd.cut(data['Age'], bins=[0, 18, 35, 60, 100], labels=['Youth', 'Young Adult', 'Adult', 'Senior'])

# Example: Extracting a brand name from a product description
data['Brand'] = data['Product_Name'].apply(lambda x: x.split()[0])

Feature engineering is both an art and a science, requiring creativity and domain expertise to uncover the most impactful attributes hidden within your data.

Step 7: Creating Features

Creating features is another critical part of feature engineering, involving the transformation or combination of raw data into inputs that better represent the underlying problem to the predictive models.

Techniques for Creating Features:

Interaction Terms: Create features that represent the interaction between two or more variables.
Polynomial Features: Generate polynomial and interaction features.
Grouping and Aggregation: Aggregate data by categories to create summary features.

# Example: Interaction between age and income
data['Age_Income_Interaction'] = data['Age'] * data['Income']

# Example: Generate polynomial and interaction features.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
data_poly = poly.fit_transform(data[['Feature1', 'Feature2']])

# Example: Average income by region
group_features = data.groupby('Region')['Income'].mean().rename('Avg_Income_By_Region')
data = data.join(group_features, on='Region')

The creation of features is a dynamic and iterative process. It requires experimentation and domain knowledge to identify the most significant and useful features for your analysis or modeling task.

Step 8: Data Cleaning/Wrangling

Data Cleaning or Data Wrangling is a critical step in ensuring the quality of your data before analysis. It involves rectifying inconsistencies, handling missing values, and making the data more suitable for analysis.

Key Aspects of Data Cleaning/Wrangling:

Handling Inconsistent Data: Correcting typos and standardizing text data.
Data Type Conversion: Converting data types for proper analysis, like changing a column to a datetime or categorical type.
Handling Missing Data: Imputing or removing missing values.
Renaming and Reordering Columns: Adjusting column names and order for better readability and consistency.

# Example: Standardizing text data
data['Brand'] = data['Brand'].str.lower().str.strip()

data['Date'] = pd.to_datetime(data['Date'])
data['Category'] = data['Category'].astype('category')

data['Column'] = data['Column'].fillna(method='ffill')  # Forward fill

data.rename(columns={'old_name': 'new_name'}, inplace=True)
data = data[['column1', 'column2', 'new_name']]  # Reordering

Cleaning and wrangling data is essential to remove noise and simplify the analysis, leading to more accurate and meaningful insights.

Step 9: EDA — Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an open-ended process where you calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The aim is not only to understand what the data is telling us but also to uncover the underlying structure, extract important variables, detect outliers and anomalies, and test underlying assumptions.

Key Components of EDA:

Descriptive Statistics: Summarize and describe the main features of the dataset.
Data Visualization: Use various plots (histograms, scatter plots, box plots) to see the distribution and relationship between variables.
Correlation Analysis: Determine how variables are related to each other.
Outlier Detection: Identify unusual observations in the dataset.

# Describe Data
data.describe()

# Basic Data visualization
sns.histplot(data['Variable1'])
sns.scatterplot(x='Variable1', y='Variable2', data=data)

# Find correlations
sns.heatmap(data.corr(), annot=True)

# Detect outliers
sns.boxplot(x=data['Variable'])

EDA is iterative and exploratory in nature. It guides how you frame your analysis, leading to hypothesis generation and further data analysis or modeling.

Step 10: Statistics Summary

The statistics summary in EDA is about providing a comprehensive overview of the data through various statistical measures. This step is crucial for understanding the distribution, central tendency, and dispersion of the data.

Key Statistical Measures:

Descriptive Statistics: Use describe() to get a summary that includes count, mean, standard deviation, min, max, and percentiles for numerical columns.

data.describe()

Central Tendency: Measures like mean, median, and mode.

mean_val = data['Column'].mean()
median_val = data['Column'].median()
mode_val = data['Column'].mode()[0]

Dispersion Measures: Standard deviation, variance, range, and interquartile range (IQR).

std_dev = data['Column'].std()
variance = data['Column'].var()
range_val = data['Column'].max() - data['Column'].min()
iqr = data['Column'].quantile(0.75) - data['Column'].quantile(0.25)

Skewness and Kurtosis: Understanding the asymmetry and tailedness of the distribution.

skewness = data['Column'].skew()
kurtosis = data['Column'].kurt()

This step helps identify if the data has any outliers, errors, or peculiarities that need attention.

Step 11: EDA — Univariate Analysis

Univariate analysis focuses on understanding each variable in isolation. It’s the simplest form of analyzing data where we examine each variable separately.

Techniques for Univariate Analysis:

Histograms: Great for visualizing the distribution of a single continuous variable.
Box Plots: Useful for spotting outliers and understanding the spread of the data.
Count Plots: Ideal for categorical data to show the frequency of each category.
Pie Charts: Visually appealing for showing the proportion of categories in a variable.
Bar Charts: Another option for displaying the frequency of categorical data.

sns.histplot(data['Variable'], bins=20)

sns.boxplot(x=data['Variable'])

sns.countplot(x=data['Categorical_Variable'])

data['Categorical_Variable'].value_counts().plot.pie(autopct='%1.1f%%')

data['Categorical_Variable'].value_counts().plot.bar()

Univariate analysis provides insights into the range, central tendency, dispersion, and shape of the distribution of each variable.

Step 12: Data Transformation

Data Transformation involves modifying the data to a more suitable format or structure for analysis. This step can include normalization, scaling, or applying mathematical transformations.

Common Data Transformations:

Normalization and Scaling: Adjusting the scale of the data without distorting differences in the ranges of values.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data[['Numeric_Column']])

Log Transformation: Useful for handling skewed data and making it more normally distributed.

data['Log_Variable'] = np.log(data['Variable'] + 1)

Encoding Categorical Variables: Converting categorical variables into a form that could be provided to ML algorithms.

data_encoded = pd.get_dummies(data['Categorical_Column'])

Feature Scaling: Standardizing the range of independent variables.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data[['Numeric_Column']])

These transformations help in preparing the data for more accurate and efficient analysis, especially when building machine learning models.

Step 13: EDA — Bivariate Analysis

Bivariate Analysis is about exploring the relationship between two different variables and understanding the correlation and patterns between them.

Techniques for Bivariate Analysis:

Scatter Plots: Ideal for visualizing the relationship between two continuous variables.
Correlation Analysis: Calculating the correlation coefficient to quantify the strength of the relationship.
Cross Tabulations: Useful for examining the relationship between two categorical variables.
Grouped Bar Charts: Comparing categories across different groups.

sns.scatterplot(x='Variable1', y='Variable2', data=data)

correlation = data[['Variable1', 'Variable2']].corr()

pd.crosstab(data['Categorical_Var1'], data['Categorical_Var2'])

sns.barplot(x='Categorical_Var1', y='Numeric_Var', hue='Categorical_Var2', data=data)

Bivariate analysis helps in identifying associations between variables, which is crucial for hypothesis testing and model building.

Step 14: EDA — Multivariate Analysis

Multivariate Analysis involves examining the relationship among multiple variables simultaneously. It helps in understanding complex interactions and patterns in the dataset.

Techniques for Multivariate Analysis:

Heatmaps for Correlation: Visualize the correlation matrix to understand the relationships between multiple variables.
Pair Plots: Useful for visualizing pairwise relationships in the dataset, particularly useful when you have several continuous variables.
3D Scatter Plots: For a more detailed view of the interactions between three variables.

msns.heatmap(data.corr(), annot=True, cmap='coolwarm')

sns.pairplot(data)

# for 3D plot
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(data['Variable1'], data['Variable2'], data['Variable3'])

Multivariate analysis is essential for uncovering hidden structures and patterns that are not apparent in univariate or bivariate analysis.

PS. You might want to explore other third party packages and techniques of data visualization. Visuals of your data snaps is very important to easily catch outliers and issues within.

Step 15: Observing Data, Insights, and Storytelling

This step is about interpreting the findings from your EDA and weaving them into a coherent narrative. It’s where data analysis transcends into the realm of storytelling and helps in decision making.

Techniques and Tips:

Observation:

Look for trends, patterns, and anomalies in your analysis.
Example: “Sales spike in Q4, possibly due to holiday shopping.”

Insight Generation:

Translate observations into actionable insights.
Example: “Product X has the highest sales, suggesting a market preference.”

Storytelling:

Craft a narrative around your data, focusing on how your findings relate to the business or research objectives. Keep your audience's expectation and mindset in count.
Use visualizations to highlight key points and make your story compelling. People understand visuals like charts, tables and highlighted numbers.

Communicating Results:

Present your findings in a way that’s understandable to your audience, using non-technical language for non-experts.

Remember, the goal is to turn data into information, and information into insights that can inform decisions and strategies. Your data is your script, and you speak.

Conclude your summery

The conclusion is where you encapsulate the key findings and insights from your Exploratory Data Analysis. This is the part where you reflect on what the data has told you and how it answers your initial questions or impacts your business decisions.

Elements of a Good Conclusion:

Summary of Findings: Briefly summarize the most important insights and patterns observed in your analysis.
Implications: Discuss the implications of your findings. How do they affect decision-making, strategies, or future analyses?
Limitations: Acknowledge any limitations or uncertainties in your analysis.
Recommendations: Based on your findings, provide recommendations or suggest areas for further research or analysis.
Engage and Get Feedback: Always engage your audience. Ask supportive questions. Make sure they are in the same thread. Ensure they are not lost and they are part of your story.

Remember, a good conclusion not only wraps up your analysis neatly but also sets the stage for the next steps or actions to be taken.

Exploratory Data Analysis (EDA) is an indispensable stage in the data science lifecycle. This comprehensive journey, leveraging Python’s robust toolkit, empowers analysts and data scientists to transform raw data into meaningful insights. Through systematic steps like data importing, cleaning, feature engineering, and various levels of analysis, EDA serves as the backbone of informed decision-making. Whether you’re a beginner or an experienced professional, mastering EDA paves the way for deeper data understanding and paves the path for advanced analytical and predictive modeling. Embrace EDA as your roadmap to uncovering the hidden stories within your data, driving impactful outcomes and strategic decisions

Find me on LinkedIn:

https://bd.linkedin.com/in/nayeemislam60053