EDA with PYTHON

TANMOY — Wed, 11 Jun 2025 03:28:05 GMT

What is Exploratory Data Analysis ?

Exploratory data analysis, or EDA, is the process of comprehending data sets by highlighting their key features, sometimes through visual graphing. This stage is crucial, particularly when it comes to modeling the data for machine learning applications.

1. Importing the required libraries for EDA

import pandas as pd
import numpy as np
import seaborn as sns                       #DATA visualisation
import matplotlib.pyplot as plt             #DATA visualisation
import warnings as wr
wr.filterwarnings('ignore')             
%matplotlib inline     
sns.set(color_codes=True)

2. Loading the data

Download the dataset from this link

#load and read it using pandas.
# df using for data frame
df = pd.read_csv("../input/dataset/data.csv")  
print(df.head())
# To display the top 5 rows 
df.head(5)
# To display the botton 5 rows
df.tail(5)
# check the datatypes
df.dtypes 
#This function is used to understand the number of rows (observations) and columns (features) in the dataset. This gives an overview of the dataset's size and structure.
df.shape
#(1143, 13) (no of rows , no of columns)  
# This function helps us to understand the dataset by showing the number of records in each column, type of data, whether any values are missing and how much memory the dataset uses.    
df.info()
#This method gives a statistical summary of the DataFrame showing values like count, mean, standard deviation, minimum and quartiles for each numerical column. It helps in summarizing the central tendency and spread of the data.
df.describe() #count mean median min and max

#This converts the column names of the DataFrame into a Python list making it easy to access and manipulate the column names.
df.columns.tolist()

Dropping the duplicate rows


#no of rows and columns
df.shape
# no of duplicate rows
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
df.count()      # Used to count the number of rows
df = df.drop_duplicates()
df.head(5)
df.count()
#This function tells us how many unique values exist in each column which provides insight into the variety of data in each feature.
df.nunique()

Dropping the missing or null values.

print(df.isnull().sum())
df = df.dropna()    # Dropping the missing values.
df.count()
print(df.isnull().sum())   # After dropping the values

Detecting Outliers

An outlier is a point or set of points that are different from other points. Sometimes they can be very high or very low. It’s often a good idea to detect and remove the outliers. Because outliers are one of the primary reasons for resulting in a less accurate model.

sns.boxplot(x=df['Price'])
sns.boxplot(x=df['HP'])
sns.boxplot(x=df['Cylinders'])
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape

Univariate Analysis

Univariate data:

Univariate data refers to a type of data in which each observation or data point corresponds to a single variable. In other words, it involves the measurement or observation of a single characteristic or attribute for each individual or item in the dataset.

Bar Plot for evaluating the count of the wine with its quality rate.

quality_counts = df['quality'].value_counts()

plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts, color='deeppink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

2. Kernel density plot for understanding variance in the dataset

sns.set_style("darkgrid")

numerical_columns = df.select_dtypes(include=["int64", "float64"]).columns

plt.figure(figsize=(14, len(numerical_columns) * 3))
for idx, feature in enumerate(numerical_columns, 1):
    plt.subplot(len(numerical_columns), 2, idx)
    sns.histplot(df[feature], kde=True)
    plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}")

plt.tight_layout()
plt.show()

from GKG

3. Swarm Plot for showing the outlier in the data

plt.figure(figsize=(10, 8))

sns.swarmplot(x="quality", y="alcohol", data=df, palette='viridis')

plt.title('Swarm Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()

Bivariate Analysis

In bivariate analysis two variables are analyzed together to identify patterns, dependencies or interactions between them. This method helps in understanding how changes in one variable might affect another.

1. Pair Plot for showing the distribution of the individual variables

sns.set_palette("Pastel1")

plt.figure(figsize=(10, 6))

sns.pairplot(df)

plt.suptitle('Pair Plot for DataFrame')
plt.show()

2. Violin Plot for examining the relationship between alcohol and Quality.

df['quality'] = df['quality'].astype(str)  

plt.figure(figsize=(10, 8))

sns.violinplot(x="quality", y="alcohol", data=df, palette={
               '3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6': 'gold', '7': 'lightskyblue', '8': 'lightpink'}, alpha=0.7)

plt.title('Violin Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()

For interpreting the Violin Plot:

If the width is wider, it shows higher density suggesting more data points.
Symmetrical plot shows a balanced distribution.
Peak or bulge in the violin plot represents most common value in distribution.
Longer tails shows great variability.
Median line is the middle line inside the violin plot. It helps in understanding central tendencies.

3. Box Plot for examining the relationship between alcohol and Quality

sns.boxplot(x='quality', y='alcohol', data=df)

Box represents the IQR i.e longer the box, greater the variability.

Median line in the box shows central tendency.
Whiskers extend from box to the smallest and largest values within a specified range.
Individual points beyond the whiskers represents outliers.
A compact box shows low variability while a stretched box shows higher variability.

Multivariate Analysis

It involves finding the interactions between three or more variables in a dataset at the same time. This approach focuses to identify complex patterns, relationships and interactions which provides understanding of how multiple variables collectively behave and influence each other.

plt.figure(figsize=(15, 10))

sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Pastel2', linewidths=2)

plt.title('Correlation Heatmap')
plt.show()

Stories by TANMOY on Medium