Data is nothing until you understand it and visualize it most effectively and this is what we call Exploratory Data Analysis(EDA)
Sounds like Intermediate?
Okay, let’s start in a beginner way?
You are taking a data mining class, and your professor says that you find one data set from the internet and try to do its EDA before implementing Data mining algorithms.
Now, here you are in a dilemma with words like data set, what, and how to do EDA….!?!
Following this journey, your dilemma will fade way.
We will cover:
- Understanding what dataset is
- Finding dataset online
- Doing the EDA Cycle:
- Importing libraries/ methods
- Reading/importing Data
- A quick look to a dataset
- A glance to the summary of statistics
- A quick look to complete nature of dataset columns (attributes )
- Checking missing value and treating it
- Creating Histogram
- Creating Scatterplot
- Creating Line plot
- Creating a Box Plot
- Creating Heat Map(Correlation Matrix)
with its interpretation.
What and where is the data set?
A dataset is a structured collection of data generally associated with people, organizations, or anything. Note that datasets don’t have to be tabular.
Where can I download free, open datasets for analysis?
Here are the primary dataset finders:
Kaggle: A data science site that contains a variety of externally-contributed new datasets. You can find all kinds of niche datasets in its master list, and it also provides a tutorial on various data-related activities.
Kaggle: Your Machine Learning and Data Science Community
Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data…
UCI Machine Learning Repository: One of the oldest sources of datasets on the web, and a great first stop when looking for interesting datasets. You can also download data directly from the UCI Machine Learning repository, without registration.
FYI: Try using Google Dataset search engine
Learn more about including your datasets in Dataset Search.
Now you know what and where to find your dataset.
Next step is to choose one of the datasets you will use for EDA.
For demo purpose, I am using the dataset of [Student Alcohol Consumption | Kaggle]
Student Alcohol Consumption
Social, gender and study data from secondary school students
Go to the above link and download it.
Once downloaded, we will get into the journey of exploring data.
As you are a beginner(hope so! ) and haven’t installed python, Jupyter, then download it because we will be using it.
Install Python and Jupyter Notebook ( I prefer using [Anaconda ] as it is simple and easy
Anaconda Python/R Distribution - Free Download
The open-source Anaconda Distribution is the easiest way to perform Python/R data science and machine learning on…
Once you install them, open up Jupyter Notebook.
Now we will start our journey of doing EDA Cycle:
EDA cycle: Understanding data quality, description, shape, patterns, relationships, and visualizing it for better understanding.
What is EDA then?
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
- Maximize insight into a data set
- Uncover underlying structure
- Extract important variables
- Detect outliers and anomalies
- Test underlying assumptions, etc.
Read more about EDA at:
1.1.1. What is EDA?
Approach Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of…
The first ethics of Data analysis is:
You should always have a clear concept of what is your OBJECTIVE
Once you have a clear view of what you want to achieve from the data, it will be far easy to analyze data
For this demo, My objective is to find the relationship between “age” and “absences.”
Snapshot of Dataset
Now let’s import the needed libraries/methods.
#importing librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Then lets read our dataset and save it to “df,” you can name “df” as anything.
#Reading .csv filedf= pd.read_csv(‘student-por.csv’)
Let’s have a quick look to the dataset.
Let’s see some summary statistics.
Let’s see the complete nature of dataset columns (attributes).
Let’s see whether we have missing value or not
# Inspect missing values in the dataset
We get ‘0’ (Zero) as output as our data do not contain any missing value: It is clean.
Here I am sharing one tip, what if our data is missing some values, and how to handle it?
For which lets do as following:
- Replace missing values with NaN
- Impute it with mean imputation
- Count the number of NaN to see if all missing value are filled out or not
Impute the missing values with mean imputation.
Read more about: Why and What is Mean imputation?
6 Different Ways to Compensate for Missing Data (Data Imputation with examples)
Popular strategies to statistically impute missing values in a dataset.
# Inspect missing values in the dataset
print(df.isnull().values.sum())# Replace the ' 's with NaN
df = df.replace(" ",np.NaN)# Impute the missing values with mean imputation
df = df.fillna(df.mean())# Count the number of NaNs in the dataset to verify
We get ‘0’ as our data is clean.
We see that this data do not contain any missing value.
We are competent to generate some graphs
# seaborn histogram
sns.distplot(df['age'], hist=True, kde=False,
bins=9, color = 'blue',
# Add labels
Interpretation of Histogram
- Age with 17 are high in number
Interpretation of Scatter Plot
- Some people are absent most with count 30 in age 17
Creating Line plot
sns.lineplot(x='absences',y='age', data=df )
Interpretation of Line Plot
- Age with 19 have high absence number: 20
- Blue shade showing confidence interval
Creating a Box plot
sns.boxplot(y = 'age',data= df, x= 'sex')
Interpretation of Box Plot
- Seen that female is highly concentrating at the age of 16–18
- Male is also in the same concentrate with female age
- The median value of age is 17
Creating Heat Map Aka. Correlation Matrix
#Heat map pearson correlation matrix
corrmat = df.corr()
f, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(corrmat, vmax=.8, square=True);
Interpretation of Heat Map
- Pearson correlation matrix shows “How each column are related to each other”
- Light color, i.e., see on the right, scale 0.8 is highly correlated, and darker color below or around -0.2 is not correlated.
- This helps in feature selection also
If you feel it is hard to look at the color and see which one is highly correlated, you can do next style of creating heat map.
plt.title('Pearson Correlation of Features', size = 15)
colormap = sns.diverging_palette(10, 220, as_cmap = True)
cmap = colormap,
square = True,
annot = True,
Interpretation of Heat Map
- In above correlation matrix, we printed the number also so it will be easy for us to see which are highly correlated and value close to 1.00 is highly correlated.
- As per our objective, we see that “age” and “absences” have value 0.15, which is below 0.25, and we can say that they are not so correlated.
As the objective was to see the relation between age and absence for this demo, we complete our EDA Cycle.
Remember having this much of EDA, do not complete the EDA cycle. How deep EDA you will be doing is based on what objective you are trying to get. However, this article will give you(as a beginner in data analysis, data mining) about what dataset is, where to get it, and how to do basic EDA in python.
Get the entire working Jupyter notebook in my GitHub
Do you want one magical line of code that can build an entire EDA cycle for you: Check out my medium article
Stay tuned for next Data Science related Post.