Exploratory Data Analysis

Published in

Women in Technology

10 min readJun 25, 2024

Exploratory Data Analysis is a very important step of any Data Science/Machine Learning Project. After we formulate the problem and gather relevant data, out job is to do Exploratory Data Analysis. We can say, Exploratory Data Analysis (EDA) is an essential preliminary step in data analysis where analysts examine and visualize data to uncover patterns, spot anomalies, and test hypotheses with the goal of extracting meaningful insights. Complete Explanation of EDA in my YouTube Channel.

Why is EDA Important?

Understanding the Data: EDA provides a first look at the data, helping analysts understand its structure, distributions, and basic statistics.
Identifying Patterns and Relationships: By visualizing data through EDA techniques, analysts can detect patterns, trends, correlations, and potential relationships between variables.
Detecting Anomalies and Outliers: EDA helps in identifying data anomalies such as outliers or missing values, which can skew analysis results if not properly addressed.
Feature Selection and Engineering: EDA informs feature engineering by highlighting which features are most relevant or informative for modeling.

The Six Processes of EDA:

1.Discovering: Discovering involves the initial exploration and familiarization with the dataset.
2. Structuring: Structuring involves organizing and preparing the dataset for analysis.
3. Cleaning: Cleaning focuses on ensuring data quality and integrity.
4. Joining: Joining involves integrating multiple datasets or combining different sources of data.
5. Validating: Validating ensures that the data meets the expected quality and assumptions.
6. Presenting: Presenting involves visualizing and communicating findings effectively.

Let’s discuss these in details:

1. Discovering:

Whenever a Data Professional works with a new data set, the first step is to understand the context of the data during the discovery stage. Data professionals familiarize themselves with the data so they can start conceptualizing how to use it.

Before Discovering the raw data: Data can be found from different sources. It needs to be converted to a dataframe from its preexisting format (ex. csv, excel, json, database or anything).

From Data discovering, we have knowledge about:

Data Overview: Obtain a high-level summary of the dataset, including its size, dimensions, and basic statistics (mean, median, variance, etc.).
Data Types: Identify the types of data (numerical, categorical, text, datetime) present in the dataset.
Previewing Data: Take a glance at a few rows and columns to get a feel for the dataset’s structure and content.

In the below code, I have downloaded a dataset named Penguins from seaborn library and written code explaining how to do Data discovering.

import seaborn as sns

# Load the penguins dataset
df = sns.load_dataset('penguins')

# Display the first few rows of the DataFrame
print(df.head())

# Display a concise summary of the DataFrame
print(df.info())

# Generate descriptive statistics of the DataFrame's numerical columns
print(df.describe())

# Display a random sample of rows from the DataFrame
print(df.sample())

# Return the number of elements in the DataFrame
print(df.size)

# Return the dimensions of the DataFrame (rows, columns)
print(df.shape)

2. Structuring:

Structuring is a part of EDA Process. It converts data types, normalize or scale numerical data as necessary.
Creating new features or deriving meaningful variables that may enhance predictive modeling.It can be done in various ways.

Sorting- The process of arranging data into meaningful order for analysis.
Extraction- The process of retrieving data from a dataset or source for further processing
Filtering- The Process of selecting a smaller part of your dataset based on specific parameters and using it for viewing or analysis
Slicing- A method for breaking information down into smaller parts to facilitate efficient examination and analysis from different viewpoints
Grouping- Aggregating individual operations of a variable into groups.

import pandas as pd

# Sample dataset
data = {
    'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Subject': ['Math', 'Science', 'Math', 'English', 'Science'],
    'Grade': [85, 95, 78, 88, 92]
}

# Creating a DataFrame
df = pd.DataFrame(data)
print(df)

# Sorting the dataset by 'Grade' in descending order
sorted_df = df.sort_values(by='Grade', ascending=False)
print(sorted_df)

# Extracting specific columns
extracted_df = df[['Student Name', 'Grade']]
print(extracted_df)

# Filtering the dataset for grades above 80
filtered_df = df[df['Grade'] > 80]
print(filtered_df)

# Slicing the dataset to get the first three rows
sliced_df = df.iloc[:3]
print(sliced_df)

# Grouping the dataset by 'Subject' and calculating the average grade
grouped_df = df.groupby('Subject').mean().reset_index()
print(grouped_df)

3. Data Cleaning:

When working with datasets, the terms dirty data and messy data often come up, and while they are related, they refer to different issues that can affect data quality.

Here’s a breakdown of the differences between the two:

Dirty Data
Dirty data refers to inaccuracies and errors in the dataset that can be attributed to a variety of factors. These issues often result in data that is incorrect or misleading.

Common types of dirty data include:

1. Duplicates: Multiple entries of the same data point.
2. Incorrect Data: Values that are wrong or do not make sense.
4. Outliers: Data points that are significantly different from others and may be due to errors.
5. Missing Data: Gaps where data should be but isn’t.

We’ll use a simple dataset of students and their scores in various subjects:

import pandas as pd
import numpy as np

# Define the dataset with the provided data
data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jack Brown', 'Emily White', 'Sarah Green', 'Chris Johnson', 'John Doe', np.nan, 'Eva Wang'],
    'Age': [25, 30, 25, 45, 28, 35, 22, 25, 40, 27],
    'Gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'DOB': ['1990-05-32', '1988-02-15', '1990-05-32', '1975-08-20', '1992-04-18', '1985-12-10', '2000-11-05', '1990-05-32', '1978-06-25', '1993-09-28'],
    'Weight (kg)': [75, 60, 75, 85, 55, 65, 70, 75, 90, 50],
    'Height (cm)': [180, 165, 180, 175, 170, 160, 185, 180, np.nan, 168],
    'Income ($)': [50000, 60000, 50000, 75000, 45000, 55000, 40000, 50000, 80000, np.nan]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Removing duplicates
df.drop_duplicates(inplace=True)

# Handling incorrect data in DOB (assuming correction)
df['DOB'] = pd.to_datetime(df['DOB'], errors='coerce')

# Handling missing data and inconsistent data
df['Height (cm)'].fillna(df['Height (cm)'].mean(), inplace=True)
df['Income ($)'].fillna(df['Income ($)'].median(), inplace=True)

# Detecting outliers in Age and Weight (kg)
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound_age = Q1 - 1.5 * IQR
upper_bound_age = Q3 + 1.5 * IQR
outliers_age = df[(df['Age'] < lower_bound_age) | (df['Age'] > upper_bound_age)]
Q1_weight = df['Weight (kg)'].quantile(0.25)
Q3_weight = df['Weight (kg)'].quantile(0.75)
IQR_weight = Q3_weight - Q1_weight
lower_bound_weight = Q1_weight - 1.5 * IQR_weight
upper_bound_weight = Q3_weight + 1.5 * IQR_weight
outliers_weight = df[(df['Weight (kg)'] < lower_bound_weight) | (df['Weight (kg)'] > upper_bound_weight)]

print("Data after cleaning:")
print(df)
print("\nOutliers in Age:")
print(outliers_age)
print("\nOutliers in Weight (kg):")
print(outliers_weight)

Messy Data
Messy data, on the other hand, refers to data that is disorganized or poorly structured, making it difficult to analyze. It may be technically correct but not in a usable format.

Common characteristics of messy data include:

1. Poor Formatting: Data that is in an unstructured format, such as free text or mixed types in the same column.
2. Lack of Standardization: Different conventions used within the same dataset, like mixed date formats or inconsistent naming conventions.
3. Mixed Data Types: Columns containing multiple data types, such as numbers and text mixed together.
4. Unnecessary Data: Extra data that is not needed for analysis but is present in the dataset.
5. Complex Nesting: Data that is deeply nested or spread across multiple tables without clear relationships.

We’ll use a dataset of students and explore these:

import pandas as pd
import numpy as np

# Sample dataset with issues

data = {
 'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
 'Details': ['Math: 85, Science: A', 'Science: 95, Math: 90', 'Math: seventy-eight', 'English: 88', 'Science: 92, English: 85'],

 # Poor formatting: Mixed types in 'Details' column, free text and numeric data mixed
 'Enrollment Date': ['2021–01–10', '2021/02/12', 'Jan 15, 2021', '10–01–2021', '2021–02–15'],

 # Lack of standardization: Different date formats
 'Grade': [85, 95, 'A+', 88, 'Ninety'],

 # Mixed data types: Numbers and text mixed together
 'Unnecessary Column': ['Yes', 'No', 'Yes', 'No', 'Yes'],
 # Unnecessary data
}

# Additional nested data to represent complex nesting
nested_data = {
 'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
 'Sports': [{'Basketball': True, 'Soccer': False}, {'Basketball': False, 'Soccer': True}, {'Basketball': False, 'Soccer': False}, {'Basketball': True, 'Soccer': True}, {'Basketball': True, 'Soccer': False}]
 # Complex nesting: Nested dictionary structure
}

# Creating DataFrames
df = pd.DataFrame(data)
df_nested = pd.DataFrame(nested_data)

# Displaying the dataset with issues
print("Dataset with Issues:")
print(df)
print("\nNested DataFrame:")
print(df_nested)

# Cleaning the dataset

# 1. Handling poor formatting by splitting 'Details' into separate columns
details_split = df['Details'].str.extract(r'Math: (\d+|seventy-eight)?[,]? Science: (\d+|A)?[,]? English: (\d+)?')
details_split.columns = ['Math', 'Science', 'English']
df_cleaned = df.join(details_split)
df_cleaned.drop(columns=['Details'], inplace=True)

# 2. Standardizing the 'Enrollment Date' format
df_cleaned['Enrollment Date'] = pd.to_datetime(df_cleaned['Enrollment Date'], errors='coerce')

# 3. Converting 'Grade' to numeric, handling non-numeric values
df_cleaned['Grade'] = pd.to_numeric(df_cleaned['Grade'], errors='coerce')

# 4. Removing unnecessary columns
df_cleaned.drop(columns=['Unnecessary Column'], inplace=True)

# 5. Flattening nested data for 'Sports' column
sports_df = df_nested['Sports'].apply(pd.Series)
df_flattened = df_nested.drop(columns=['Sports']).join(sports_df)

# Displaying the cleaned dataset
print(df_cleaned)
print("\nFlattened Nested DataFrame:")
print(df_flattened)

These are the basic steps we can do for cleaning the data.

Please read these blogs on Data Cleaning

Once, you read all these blogs, you are done with Data Cleaning.

4. Data Joining:

Merge: The merge function in pandas is used to combine two or more Data Frames based on common columns or indices into a single DataFrame.

Concatenate: The concat function in pandas is used to concatenate two or more Data Frames along rows (axis=0) or columns (axis=1). It stacks Data Frames vertically or horizontally, ignoring index and creating a continuous index by default.

import pandas as pd

# Student information dataset
data_students = {
    'Student ID': [1, 2, 3, 4, 5],
    'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Grade': [85, 95, 78, 88, 92]
}

df_students = pd.DataFrame(data_students)

# Student sports activities dataset
data_sports = {
    'Student ID': [1, 2, 3, 6],
    'Sport': ['Basketball', 'Soccer', 'Tennis', 'Swimming']
}

df_sports = pd.DataFrame(data_sports)

# Merging the two datasets on 'Student ID'
df_merged = pd.merge(df_students, df_sports, on='Student ID', how='inner')

# Displaying the merged dataset
print("Merged Dataset:")
print(df_merged)

# Student grades dataset
data_grades = {
    'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Math': [85, 90, 78, 88, 92],
    'Science': [88, 95, 80, 85, 90]
}

df_grades = pd.DataFrame(data_grades)

# Student attendance dataset
data_attendance = {
    'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Attendance': [95, 85, 90, 92, 88]
}

df_attendance = pd.DataFrame(data_attendance)

# Concatenating the two datasets along columns
df_concatenated = pd.concat([df_grades, df_attendance.drop(columns=['Student Name'])], axis=1)

# Displaying the concatenated dataset
print(df_concatenated)

5. Data Validation:

Understanding that data is not valid in the context of Exploratory Data Analysis (EDA) typically involves identifying anomalies, inconsistencies, or errors that could undermine the quality or reliability of the dataset.

Here are common techniques used for validating data during EDA:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create a hypothetical dataset
data = {
    'numeric_var': [10, 15, 20, 25, 30],
    'categorical_var': ['A', 'B', 'A', 'C', 'B'],
    'datetime_var': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']),
    'missing_var': [1, 2, np.nan, 4, 5],
    'duplicate_var': ['X', 'Y', 'X', 'Z', 'Y']
}

df = pd.DataFrame(data)

# Mean, Median, Mode
mean_val = df['numeric_var'].mean()
median_val = df['numeric_var'].median()
mode_val = df['numeric_var'].mode().values[0]

# Standard Deviation, Range
std_dev = df['numeric_var'].std()
data_range = df['numeric_var'].max() - df['numeric_var'].min()

print(f"Mean: {mean_val}, Median: {median_val}, Mode: {mode_val}")
print(f"Standard Deviation: {std_dev}, Range: {data_range}")

from scipy.stats import normaltest

# Normality Test (assuming numeric_var)
stat, p = normaltest(df['numeric_var'].dropna())
print(f"Statistic: {stat}, p-value: {p}")

if p < 0.05:
    print("Data is not normally distributed.")
else:
    print("Data follows a normal distribution.")

Summary Statistics:

Mean, Median, and Mode: Calculate central tendency measures to check for extreme values.
Standard Deviation, Range: Assess variability and spread of data points.

Statistical Tests:

Normality Tests: Check if numerical data follow a normal distribution.

6. Presenting:

Exploratory Data Analysis (EDA) involves using various types of plots to visualize and understand the data better.

Below are different types of plots using a sample dataset to illustrate EDA.

Data Visualization: Use charts (histograms, scatter plots, box plots, etc.) and graphs to represent patterns, trends, and relationships.
Summary Reports: Create concise summaries and reports that highlight key insights and findings from the EDA process.
Interactive Dashboards: Develop interactive visualizations or dashboards to facilitate exploration and sharing of insights with stakeholders.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sample dataset
data = {
    'Student Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace'],
    'Math': [85, 95, 78, 88, 92, 70, 65],
    'Science': [88, 85, 80, 90, 75, 60, 95],
    'English': [82, 78, 85, 88, 92, 75, 80],
    'Attendance': [95, 85, 90, 92, 88, 80, 85],
    'Sport': ['Basketball', 'Soccer', 'Tennis', 'Basketball', 'Soccer', 'Tennis', 'Basketball']
}

df = pd.DataFrame(data)

# Histogram for Math scores
plt.figure(figsize=(8, 6))
plt.hist(df['Math'], bins=5, edgecolor='k')
plt.title('Distribution of Math Scores')
plt.xlabel('Math Scores')
plt.ylabel('Frequency')
plt.show()

# Box plot for scores in different subjects
plt.figure(figsize=(8, 6))
df[['Math', 'Science', 'English']].boxplot()
plt.title('Box Plot of Scores in Different Subjects')
plt.ylabel('Scores')
plt.show()

# Scatter plot between Math and Science scores
plt.figure(figsize=(8, 6))
plt.scatter(df['Math'], df['Science'])
plt.title('Scatter Plot between Math and Science Scores')
plt.xlabel('Math Scores')
plt.ylabel('Science Scores')
plt.show()

# Bar plot for average scores by sport
average_scores_by_sport = df.groupby('Sport').mean().reset_index()
plt.figure(figsize=(8, 6))
plt.bar(average_scores_by_sport['Sport'], average_scores_by_sport['Math'], label='Math')
plt.bar(average_scores_by_sport['Sport'], average_scores_by_sport['Science'], bottom=average_scores_by_sport['Math'], label='Science')
plt.bar(average_scores_by_sport['Sport'], average_scores_by_sport['English'], bottom=average_scores_by_sport['Math'] + average_scores_by_sport['Science'], label='English')
plt.title('Average Scores by Sport')
plt.xlabel('Sport')
plt.ylabel('Average Scores')
plt.legend()
plt.show()

# Pie chart for the distribution of students by sport
sport_counts = df['Sport'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(sport_counts, labels=sport_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Students by Sport')
plt.show()

# Heatmap for the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(df[['Math', 'Science', 'English', 'Attendance']].corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Correlations')
plt.show()

# Pair plot for scores and attendance
sns.pairplot(df[['Math', 'Science', 'English', 'Attendance']])
plt.show()

EDA is a part of a Data Science Project where you spend most of the time and this is a place where you can research and it can give you many unknown insights. In this blog, we have understood what is EDA and in our next blog, we will understand how to perform EDA i.e. will learn how to think…

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇. If you appreciate my hard work please follow me. That is the only way I can continue my passion.