“The EDA Journey: Uncovering Treasures within Your Data”.
Table of contents :-
Understanding the data.
What is EDA?
what is Univariate, Bivariate and Multivariate Analysis.
Pandas profiling.
Before diving into Exploratory Data Analysis (EDA), it is essential to develop a deep understanding of the data . This understanding involves gaining insights into the various aspects of the dataset.
To understand the various aspect of the dataset ASK 7 QUESTIONS.
Lets discuss this 7 questions with the help of very famous “titanic” dataset.
import pandas as pd
df = pd.read_csv("D:/Machine Learning/EDA/Titanic-Dataset.csv")
Que1) How Big is the Data?
df.shape
Output :-
This question focuses on the size of the dataset, including the number of rows, columns. Knowing the size of the data helps us gauge the overall volume and complexity of the dataset.
Que2) How does the data look like?
df.head()
Output :-
This question focuses on gaining an initial Sneak peek of the data. By inspecting a sample of the dataset, we can get a sense of the data’s structure, format, and general patterns.
Que3) What is the data type of columns?
df.info()
Output :-
Understanding the data types of each column is crucial. This involves identifying whether each column contains numerical values (e.g., integers or floats), categorical values (e.g., strings or factors), dates, or other specific data types. Knowing the data types helps us apply appropriate analysis techniques and transformations.
Que4) Are there any missing values?
df.isnull().sum()
Output :-
Missing values are unknown entries in the dataset. Identifying and handling missing values is essential for accurate analysis.
Que5) How does the data look like mathematically?
df.describe()
Output :-
This question focuses on exploring the statistical properties of the data. By calculating descriptive statistics, such as measures of central tendency (mean, median) and dispersion (variance, standard deviation).
Que6) Are there duplicate values?
df.duplicated().sum()
Output :-
Identifying and handling duplicate values is crucial for maintaining data integrity. By checking for and removing duplicate entries, we ensure that each observation is unique and representative of distinct entities. Detecting duplicates avoids potential biases or inaccuracies in the analysis.
Que7) How is the correlation between the columns?
df.corr()
Output :-
Understanding the correlation between columns helps uncover relationships and dependencies within the dataset. By calculating correlation coefficients, such as Pearson’s correlation or Spearman’s rank correlation, we can determine the strength and direction of the relationships between pairs of variables.
By answering these seven questions, we establish a strong foundation for conducting a useful “ Exploratory Data Analysis (EDA) ” and uncovering valuable insights from the data in a simple and understandable manner.
Code and dataset Link :-
What is Exploratory Data Analysis(EDA)?
EDA stands for Exploratory Data Analysis, which is the process of analyzing and visualizing data to uncover patterns, relationships, and insights.
Types of techniques of EDA :-
- Univariate Analysis:-
- Bivariate Analysis :-
- Multivariate Analysis :-
- Univariate Analysis :-
Univariate analysis in Exploratory Data Analysis (EDA) involves analyzing and summarizing a single variable to gain insights and patterns.
Lets understand univariate analysis with the help of “titanic” dataset.
import pandas as pd
import seaborn as sns
Categorical Data :-
- Count plot :- Count Plot displays the count or frequency of observations for different categories.
sns.countplot(df['Survived'])
df['Survived'].value_counts()
#df['Survived'].value_counts().plot(kind='bar')
Output :- Below graph displays count of survived and non-survived people.
2. Pie Chart :- Pie chart is used to get the information in percentage.
df['Survived'].value_counts().plot(kind='pie',autopct='%.2f')
Output :- out of 100% , 61.62 percent people died and only 38.38% people survived.
Numerical Data :-
- Histogram :- It represents the distribution of a continuous or numeric variable. It consists of a series of adjacent bars, where the width of each bar represents a range of values, and the height of each bar corresponds to the frequency or count of observations within that range.
import matplotlib.pyplot as plt
plt.hist(df['Age'],bins=5)
Output :-
2. Distribution plot :- It tells distribution of data. It represents probability distribution function. it gives probability on y-axis.
sns.distplot(df['Age'])
Output :-
3. Box Plot :- A boxplot is a graphical representation of the distribution of a numeric variable, showing its median, quartiles, and potential outliers. It gives 5 number summary.
sns.boxplot(df['Fare'])
Output :-
Code and dataset Link :-
2.Bivariate and Multivariate analysis :-
Bivariate analysis in Exploratory Data Analysis (EDA) involves analyzing the relationship between two variables to understand their association and uncover patterns or correlations.
multivariate analysis refers to the process of analyzing and visualizing multiple variables simultaneously to uncover patterns, relationships, and dependencies among them
lets understand bivariate and multivariate analysis with the help of tips, flights, iris and titanic datasets.
import pandas as pd
import seaborn as sns
tips = sns.load_dataset('tips')
flights = sns.load_dataset('flights')
iris = sns.load_dataset('iris')
titanic = pd.read_csv("D:/Machine Learning/EDA/Titanic-Dataset.csv")
1. Scatter Plot(Numerical — Numerical) :-A scatter plot is a graphical representation that displays the relationship and distribution of two continuous variables through a collection of points on a coordinate system.
#sns.scatterplot(tips['total_bill'],tips['tip'],hue=tips['sex'],style=tips['smoker'],size=tips['size'])
sns.scatterplot(data = tips,x='total_bill',y='tip',hue='sex',style='smoker',size='size')
Output :-
2. Bar Plot(Numerical — Categorical) :-
A bar plot is a visual representation that uses rectangular bars to show the frequency, count, or distribution of different categories or levels of a categorical variable.
sns.barplot(data=titanic, x='Pclass', y='Age', hue='Sex')
#sns.barplot(data = titanic, titanic['Pclass'],titanic['Age'],hue=titanic['Sex'])
Output :-
3. Box Plot(Numerical — Categorical) :- Box plot is used to identify the outliers.
sns.boxplot(data =titanic, x='Sex',y='Age',hue='Survived')
#sns.boxplot(titanic['Sex'],titanic['Age'],hue=titanic['Survived'])
Output :-
4. Distplot(Numerical — Categorical) :-A distplot is a visualization that combines a histogram and a kernel density estimate to represent the distribution.
sns.distplot(titanic[titanic['Survived']==0]['Age'],hist=False)
sns.distplot(titanic[titanic['Survived']==1]['Age'],hist=False)
Output :-
5. Heatmap(Categorical — Categorical) :- A heatmap is a graphical representation that uses colors to display the intensity or magnitude of values in a matrix or table
sns.heatmap(pd.crosstab(titanic['Pclass'],titanic['Survived']))
Output :-
6. Cluster map(Categorical — Categorical) :- A cluster map is a visual representation that uses hierarchical clustering to group and display similarities or dissimilarities between variables or samples in a dataset.
sns.clustermap(pd.crosstab(titanic['Parch'],titanic['Survived']))
Output :-
7. Pair plot (Numerical - Numerical) :- A pair plot is a visual representation that displays pairwise relationships between variables in a dataset using scatter plots or other plot types.
It is a collection of scatter plots.
sns.pairplot(iris,hue='species')
Output :-
8. Line Plot(Numerical — Numerical) :- A line plot is a visual representation that displays the trend and pattern of a variable over a continuous axis, using connected line segments to represent the values over that axis.
Line plot is useful when we have time related columns such as date, month, year, etc..
new = flights.groupby('year').sum().reset_index()
sns.lineplot(new['year'],new['passengers'])
Output :-
Code and dataset Link :-
Pandas Profiling :-
Pandas profiling is a Python library that generates a comprehensive HTML report providing descriptive statistics and insights about a given dataset. By using the pandas-profiling library, users can quickly generate an overview of the data, including information about data types, missing values, correlations, and basic statistical measures.
import pandas as pd
df = pd.read_csv("D:/Machine Learning/EDA/Titanic-Dataset.csv")
!pip install pandas-profiling
from pandas_profiling import ProfileReport
prof = ProfileReport(df) # Creating object of profile report.
prof.to_file(output_file='output.html') # Converting object into a file which is in the form of HTML
After executing above code, a new file will created with name “output.html”. This file contains all the reports related to the data.
HTML File Link :-
Code and dataset link :-
Thank you for joining me on this insightful journey through Exploratory Data Analysis (EDA).
I hope this blog has enhanced your understanding of EDA and its various techniques. Remember, EDA is not just a process; it’s a gateway to discovering meaningful insights and unlocking the true potential of your data.
Once again, thank you for joining me on this data exploration journey. May your future analyses be filled with valuable discoveries and exciting insights.