Exploratory Data Analysis for Beginner

Doing EDA for the First Time!

Mala Deep
Mala Deep
Feb 14 · 7 min read
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
We will generate all of the graphs

Data is nothing until you understand it and visualize it most effectively and this is what we call Exploratory Data Analysis(EDA)

You are taking a data mining class, and your professor says that you find one data set from the internet and try to do its EDA before implementing Data mining algorithms.

Now, here you are in a dilemma with words like data set, what, and how to do EDA….!?!

Following this journey, your dilemma will fade way.

We will cover:

  1. Understanding what dataset is
  2. Finding dataset online
  3. Doing the EDA Cycle:
  • Importing libraries/ methods
  • Reading/importing Data
  • A quick look to a dataset
  • A glance to the summary of statistics
  • A quick look to complete nature of dataset columns (attributes )
  • Checking missing value and treating it
  • Creating Histogram
  • Creating Scatterplot
  • Creating Line plot
  • Creating a Box Plot
  • Creating Heat Map(Correlation Matrix)

with its interpretation.

What and where is the data set?

Where can I download free, open datasets for analysis?

Here are the primary dataset finders:

Kaggle: A data science site that contains a variety of externally-contributed new datasets. You can find all kinds of niche datasets in its master list, and it also provides a tutorial on various data-related activities.

UCI Machine Learning Repository: One of the oldest sources of datasets on the web, and a great first stop when looking for interesting datasets. You can also download data directly from the UCI Machine Learning repository, without registration.

FYI: Try using Google Dataset search engine

Now you know what and where to find your dataset.

Next step is to choose one of the datasets you will use for EDA.

For demo purpose, I am using the dataset of [Student Alcohol Consumption | Kaggle]

Go to the above link and download it.

Once downloaded, we will get into the journey of exploring data.

As you are a beginner(hope so! ) and haven’t installed python, Jupyter, then download it because we will be using it.

Install Python and Jupyter Notebook ( I prefer using [Anaconda ] as it is simple and easy

Once you install them, open up Jupyter Notebook.

Now we will start our journey of doing EDA Cycle:

EDA cycle: Understanding data quality, description, shape, patterns, relationships, and visualizing it for better understanding.

What is EDA then?

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

  • Maximize insight into a data set
  • Uncover underlying structure
  • Extract important variables
  • Detect outliers and anomalies
  • Test underlying assumptions, etc.

Read more about EDA at:

Once you have a clear view of what you want to achieve from the data, it will be far easy to analyze data

Snapshot of Dataset

Image for post
Image for post
Snapshot of Dataset

Now let’s import the needed libraries/methods.

#importing librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Then lets read our dataset and save it to “df,” you can name “df” as anything.

#Reading .csv filedf= pd.read_csv(‘student-por.csv’)

Let’s have a quick look to the dataset.

df.head()

Let’s see some summary statistics.

df.describe()
Image for post
Image for post
Result of df.describe()

Let’s see the complete nature of dataset columns (attributes).

df.info()
Image for post
Image for post
Result of df.info()

Missing Values?

Let’s see whether we have missing value or not

# Inspect missing values in the dataset
print(df.isnull().values.sum())

We get ‘0’ (Zero) as output as our data do not contain any missing value: It is clean.

Here I am sharing one tip, what if our data is missing some values, and how to handle it?

For which lets do as following:

  • Replace missing values with NaN
  • Impute it with mean imputation
  • Count the number of NaN to see if all missing value are filled out or not

Impute the missing values with mean imputation.

Read more about: Why and What is Mean imputation?

# Inspect missing values in the dataset
print(df.isnull().values.sum())
# Replace the ' 's with NaN
df = df.replace(" ",np.NaN)
# Impute the missing values with mean imputation
df = df.fillna(df.mean())
# Count the number of NaNs in the dataset to verify
print(df.isnull().values.sum())

We get ‘0’ as our data is clean.

We see that this data do not contain any missing value.

We are competent to generate some graphs

Creating Histogram

# seaborn histogram 
sns.distplot(df['age'], hist=True, kde=False,
bins=9, color = 'blue',
hist_kws={'edgecolor':'black'})
# Add labels
plt.title('Age count')
plt.xlabel('Age')
plt.ylabel('Count')
Image for post
Image for post
Histogram of age count

Interpretation of Histogram

  • Age with 17 are high in number

Creating Scatterplot

#seaborn Scatterplot
sns.scatterplot(x=df['age'], y=df['absences'])
Image for post
Image for post
Scatterplot between age and absences

Interpretation of Scatter Plot

  • Some people are absent most with count 30 in age 17

Creating Line plot

#line plot 
sns.lineplot(x='absences',y='age', data=df )
Image for post
Image for post
Line plot between age and absences

Interpretation of Line Plot

  • Age with 19 have high absence number: 20
  • Blue shade showing confidence interval

Creating a Box plot

#Box plot
sns.boxplot(y = 'age',data= df, x= 'sex')
plt.xlabel('SEX')
Image for post
Image for post
Box plot between age and sex

Interpretation of Box Plot

  • Seen that female is highly concentrating at the age of 16–18
  • Male is also in the same concentrate with female age
  • The median value of age is 17

Creating Heat Map Aka. Correlation Matrix

#Heat map pearson correlation matrix
corrmat = df.corr()
f, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(corrmat, vmax=.8, square=True);
Image for post
Image for post
Pearson correlation matrix which shows “How each column are corelated to each other”

Interpretation of Heat Map

  • Pearson correlation matrix shows “How each column are related to each other
  • Light color, i.e., see on the right, scale 0.8 is highly correlated, and darker color below or around -0.2 is not correlated.
  • This helps in feature selection also

If you feel it is hard to look at the color and see which one is highly correlated, you can do next style of creating heat map.

plt.figure(figsize=(30,30))
plt.title('Pearson Correlation of Features', size = 15)
colormap = sns.diverging_palette(10, 220, as_cmap = True)
sns.heatmap(df.corr(),
cmap = colormap,
square = True,
annot = True,
linewidths=0.1,vmax=1.0, linecolor='white',
annot_kws={'fontsize':12 })
plt.show()
Image for post
Image for post

Interpretation of Heat Map

  • In above correlation matrix, we printed the number also so it will be easy for us to see which are highly correlated and value close to 1.00 is highly correlated.
  • As per our objective, we see that “age” and “absences” have value 0.15, which is below 0.25, and we can say that they are not so correlated.

As the objective was to see the relation between age and absence for this demo, we complete our EDA Cycle.

Conclusion:

Remember having this much of EDA, do not complete the EDA cycle. How deep EDA you will be doing is based on what objective you are trying to get. However, this article will give you(as a beginner in data analysis, data mining) about what dataset is, where to get it, and how to do basic EDA in python.

Get the entire working Jupyter notebook in my GitHub

Ahh! Tired??

Do you want one magical line of code that can build an entire EDA cycle for you: Check out my medium article

Stay tuned for next Data Science related Post.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Data Science Blogathon: Win Lucrative Prizes!

By Analytics Vidhya

Launching the Second Data Science Blogathon – An Unmissable Chance to Write and Win Prizesprizes worth INR 30,000+! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Mala Deep

Written by

Mala Deep

Data Science | Data Visualization | Community Work Focused | Philekoos | https://www.linkedin.com/in/maladeep/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Mala Deep

Written by

Mala Deep

Data Science | Data Visualization | Community Work Focused | Philekoos | https://www.linkedin.com/in/maladeep/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store