How to Conduct an Effective Exploratory Data Analysis (EDA)

The Ultimate Guide

Published in

ILLUMINATION

6 min readNov 2, 2023

Hello there, back with a hot topic in the data science world — Exploratory Data Analysis (EDA). Think of EDA as your preliminary detective work; it’s the part that can make or break your subsequent data modeling. I will try to explain EDA through my understanding in this article. So, let’s get started!

Why EDA?

Imagine you’re an archaeologist. Would you dig randomly? — No, you’d survey the land first, right?

Similarly, EDA helps you know where to dig for insights. It’s about preventing flawed models and wrong conclusions by understanding your data’s existing structure and variables. (features)

A Structured Approach: The APP Framework

Let’s employ my beloved APP framework for this:

Attention: Understand your dataset.
Purpose: Establish your goals.
Process: Do the actual EDA.
Pay-off: Extract and implement insights.

Step 1: Attention — Know Your Data

Data Types

Run df.dtypes in Python to understand the variable types you have.

Knowing whether you're dealing with numerical or categorical data, for example, will inform your subsequent steps.

Data Size and Shape

To see the size and shape of your data, run df.shape.

If you're dealing with Big Data, you might have to employ specialized tools for your EDA.

Missing Values

Execute df.isnull().sum() to find the missing values.

This can indicate whether you can rely on a column for analysis or if you need to take steps for imputation.

Step 2: Purpose — Goals & Objectives

Are you trying to understand user behavior? Perhaps predict future sales? By understanding your purpose, you can target your EDA effectively.

Step 3: Process — The Heart of EDA

Here’s where the action happens. We’ll dive into four essential components:

Data Summarization

Utilize df.describe() to summarize your data. This gives you the mean, median, standard deviation, etc.

For categorical data, use df['column_name'].value_counts() to see the frequency of each category.

Data Cleaning

1. Handling Missing Values: If a column has too many missing values, consider dropping it. If only a few are missing, think about imputation methods. There are several imputation methods you can consider:

Mean/Median/Mode Imputation: Replace missing values with the mean (for normally distributed data), median (for skewed data), or mode (for categorical data).

df['column'].fillna(df['column'].mean(), inplace=True)

Forward or Backward Fill: Particularly useful in time-series data, where you can fill missing values with the previous or next value.

df['column'].fillna(method='ffill', inplace=True)

K-Nearest Neighbors (KNN) Imputation: This involves filling missing values based on similar data points. It’s a bit more complex but can be more accurate.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df)

Model-Based Imputation: You can train a machine learning model (like a regression model) to predict missing values based on other variables.
Deletion: Sometimes, it’s best to remove rows or columns with missing values, especially if they comprise a significant portion of your data.

df.dropna(inplace=True)

2. Outliers: Outliers can significantly skew your results. Here’s how you can detect and handle them:

Boxplot: A visual method to spot outliers. Values that fall outside of the whiskers are potential outliers.

sns.boxplot(x=df['Column_Name'])

Z-Score: This tells you how many standard deviations a point is from the mean. Generally, a Z-score above 3 or below -3 is considered an outlier.

from scipy import stats
z_scores = stats.zscore(df['Column_Name'])
outliers = df[(z_scores < -3) | (z_scores > 3)]

IQR (Interquartile Range): IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). Values that fall below Q1–1.5 * IQR or above Q3 + 1.5 * IQR are potential outliers.

Q1 = df['Column_Name'].quantile(0.25)
Q3 = df['Column_Name'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Column_Name'] < (Q1 - 1.5 * IQR)) | (df['Column_Name'] > (Q3 + 1.5 * IQR))]

Once you’ve identified outliers, you can decide whether to remove them, cap them, or investigate further to understand their cause. Remember, sometimes outliers carry valuable information, so tread carefully!

If you have enjoyed the story so far and are feeling generous, you can support me by giving me a “Tip” as I am unable to join in MPP because of the location barrier.

Data Transformation

Transforming your data into a more suitable form or structure can make it easier to generate insights and build models. Let’s dig deeper into some popular transformation techniques:

1. Normalization: Normalization helps to scale numeric data from different columns down to an equivalent scale so that no particular set of values dominates the other.

Min-Max Scaling: This method scales the data between a specified range, usually between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['normalized_column'] = scaler.fit_transform(df[['column']])

Z-Score Normalization (Standardization): Data is transformed to have a mean of 0 and a standard deviation of 1. It’s especially useful for algorithms that rely on distances or gradients.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['standardized_column'] = scaler.fit_transform(df[['column']])

Log Transformation: Useful for data that’s highly skewed. It can help in reducing the impact of outlier values.

import numpy as np
df['log_transformed_column'] = np.log(df['column'])

2. Encoding: For categorical data, use One-Hot Encoding or Label Encoding to convert categories into numbers.

One-Hot Encoding: This method creates a new column for each category and indicates the presence with a 1 or 0. It’s suitable when there’s no ordinal relationship between the categories.

df_encoded = pd.get_dummies(df, columns=['categorical_column'])

Label Encoding: Categories are transformed into a sequence of integers. It’s crucial when there’s a natural ordinal relationship between the categories.

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['encoded_column'] = encoder.fit_transform(df['categorical_column'])

Target Encoding (Mean Encoding): For each category, the mean of the target variable is calculated and used as the new feature.

mean_encoded = df.groupby(['categorical_column'])['target'].mean().to_dict()
df['encoded_column'] = df['categorical_column'].map(mean_encoded)

Remember, transformation decisions should always be based on the underlying nature of your data and the specific requirements of your subsequent analysis or modeling.

When in doubt, visualization and summarization techniques can guide your decisions in this stage.

Data Visualization

Histograms: Use histograms to see the distribution of a numerical variable.

import matplotlib.pyplot as plt
plt.hist(df['Income'])
plt.show()

2. Correlation Heatmap: To see how different variables interact, create a heatmap.

corr = df.corr()
sns.heatmap(corr, annot=True)

3. Pair Plots: These provide pairwise relationships for a quick overview.

sns.pairplot(df)

Step 4: Pay-off — Reaping the Benefits

Now it’s time to gather the rewards. The insights could be as simple as understanding your customer base better or as intricate as identifying the right features for a machine learning model.

Conclusion

You’ve now traversed the winding paths of EDA and arrived, hopefully, at some eye-opening insights. By employing the APP framework and diving deep into data summarization, cleaning, transformation, and visualization, you’ve set a solid foundation for any data science project.

So, the next time you’re handed a raw dataset, you’ll know how to tackle it like a pro.

I hope this comprehensive guide brings you much success in your future data science journey. Comment on this article if you found it valuable, and stay tuned for more!

Lastly, I want to introduce my weekly newsletter where, I will start sharing valuable data science case studies, free eBooks and AI trends , like these case studies:

Case Study #1: RFM Analysis using Python

medium.com

Case Study #2: Customer Lifetime Value Analysis (CLTV) using Python

Customer Lifetime Value (CLTV) is another term for “Customer Lifetime Revenue (CLTR).” It is largely used in the…

medium.com

Case Study #3: Netflix Subscription Forecasting — Resume Project

A Guide to a Resume-worthy Data Science Project

medium.com

SUBSCRIBE HERE👇

Link: https://ai-codehub.beehiiv.com/

Free eBooks:

Subscribe to Richard Warepam and Get Free eBooks and Data Science eBooks

codewarepam.gumroad.com

How to Conduct an Effective Exploratory Data Analysis (EDA)

The Ultimate Guide

Why EDA?

A Structured Approach: The APP Framework

Step 1: Attention — Know Your Data

Data Types

Data Size and Shape

Missing Values

Step 2: Purpose — Goals & Objectives

Step 3: Process — The Heart of EDA

Data Summarization

Data Cleaning

Data Transformation

Data Visualization

Step 4: Pay-off — Reaping the Benefits

Conclusion

Case Study #1: RFM Analysis using Python

Case Study #2: Customer Lifetime Value Analysis (CLTV) using Python

Customer Lifetime Value (CLTV) is another term for “Customer Lifetime Revenue (CLTR).” It is largely used in the…

Case Study #3: Netflix Subscription Forecasting — Resume Project

A Guide to a Resume-worthy Data Science Project

Free eBooks:

Subscribe to Richard Warepam and Get Free eBooks and Data Science eBooks

Written by Richard Warepam