# Exploratory Data Analysis for Beginner

## Doing EDA for the First Time!

Feb 14 · 7 min read

Data is nothing until you understand it and visualize it most effectively and this is what we call Exploratory Data Analysis(EDA)

You are taking a data mining class, and your professor says that you find one data set from the internet and try to do its EDA before implementing Data mining algorithms.

Now, here you are in a dilemma with words like data set, what, and how to do EDA….!?!

We will cover:

1. Understanding what dataset is
2. Finding dataset online
3. Doing the EDA Cycle:
• Importing libraries/ methods
• A quick look to a dataset
• A glance to the summary of statistics
• A quick look to complete nature of dataset columns (attributes )
• Checking missing value and treating it
• Creating Histogram
• Creating Scatterplot
• Creating Line plot
• Creating a Box Plot
• Creating Heat Map(Correlation Matrix)

with its interpretation.

## What and where is the data set?

Here are the primary dataset finders:

Kaggle: A data science site that contains a variety of externally-contributed new datasets. You can find all kinds of niche datasets in its master list, and it also provides a tutorial on various data-related activities.

UCI Machine Learning Repository: One of the oldest sources of datasets on the web, and a great first stop when looking for interesting datasets. You can also download data directly from the UCI Machine Learning repository, without registration.

FYI: Try using Google Dataset search engine

Now you know what and where to find your dataset.

Next step is to choose one of the datasets you will use for EDA.

For demo purpose, I am using the dataset of [Student Alcohol Consumption | Kaggle]

As you are a beginner(hope so! ) and haven’t installed python, Jupyter, then download it because we will be using it.

Install Python and Jupyter Notebook ( I prefer using [Anaconda ] as it is simple and easy

Once you install them, open up Jupyter Notebook.

Now we will start our journey of doing EDA Cycle:

EDA cycle: Understanding data quality, description, shape, patterns, relationships, and visualizing it for better understanding.

What is EDA then?

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

• Maximize insight into a data set
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions, etc.

Once you have a clear view of what you want to achieve from the data, it will be far easy to analyze data

Snapshot of Dataset

Now let’s import the needed libraries/methods.

`#importing librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlineimport seaborn as sns`

Then lets read our dataset and save it to “df,” you can name “df” as anything.

`#Reading .csv filedf= pd.read_csv(‘student-por.csv’)`

Let’s have a quick look to the dataset.

`df.head()`

Let’s see some summary statistics.

`df.describe()`

Let’s see the complete nature of dataset columns (attributes).

`df.info()`

## Missing Values?

Let’s see whether we have missing value or not

`# Inspect missing values in the datasetprint(df.isnull().values.sum())`

We get ‘0’ (Zero) as output as our data do not contain any missing value: It is clean.

Here I am sharing one tip, what if our data is missing some values, and how to handle it?

For which lets do as following:

• Replace missing values with NaN
• Impute it with mean imputation
• Count the number of NaN to see if all missing value are filled out or not

Impute the missing values with mean imputation.

`# Inspect missing values in the datasetprint(df.isnull().values.sum())# Replace the ' 's with NaNdf = df.replace(" ",np.NaN)# Impute the missing values with mean imputationdf = df.fillna(df.mean())# Count the number of NaNs in the dataset to verifyprint(df.isnull().values.sum())`

We get ‘0’ as our data is clean.

We see that this data do not contain any missing value.

We are competent to generate some graphs

## Creating Histogram

`# seaborn histogram sns.distplot(df['age'], hist=True, kde=False,              bins=9, color = 'blue',             hist_kws={'edgecolor':'black'})# Add labelsplt.title('Age count')plt.xlabel('Age')plt.ylabel('Count')`

Interpretation of Histogram

• Age with 17 are high in number

## Creating Scatterplot

`#seaborn Scatterplotsns.scatterplot(x=df['age'], y=df['absences'])`

Interpretation of Scatter Plot

• Some people are absent most with count 30 in age 17

## Creating Line plot

`#line plot sns.lineplot(x='absences',y='age', data=df )`

Interpretation of Line Plot

• Age with 19 have high absence number: 20
• Blue shade showing confidence interval

## Creating a Box plot

`#Box plotsns.boxplot(y = 'age',data= df, x= 'sex')plt.xlabel('SEX')`

Interpretation of Box Plot

• Seen that female is highly concentrating at the age of 16–18
• Male is also in the same concentrate with female age
• The median value of age is 17

## Creating Heat Map Aka. Correlation Matrix

`#Heat map pearson correlation matrixcorrmat = df.corr()f, ax = plt.subplots(figsize=(16, 12))sns.heatmap(corrmat, vmax=.8, square=True);`

Interpretation of Heat Map

• Pearson correlation matrix shows “How each column are related to each other
• Light color, i.e., see on the right, scale 0.8 is highly correlated, and darker color below or around -0.2 is not correlated.
• This helps in feature selection also

If you feel it is hard to look at the color and see which one is highly correlated, you can do next style of creating heat map.

`plt.figure(figsize=(30,30))plt.title('Pearson Correlation of Features', size = 15)colormap = sns.diverging_palette(10, 220, as_cmap = True)sns.heatmap(df.corr(),            cmap = colormap,            square = True,            annot = True,            linewidths=0.1,vmax=1.0, linecolor='white',            annot_kws={'fontsize':12 })plt.show()`

Interpretation of Heat Map

• In above correlation matrix, we printed the number also so it will be easy for us to see which are highly correlated and value close to 1.00 is highly correlated.
• As per our objective, we see that “age” and “absences” have value 0.15, which is below 0.25, and we can say that they are not so correlated.

As the objective was to see the relation between age and absence for this demo, we complete our EDA Cycle.

## Conclusion:

Remember having this much of EDA, do not complete the EDA cycle. How deep EDA you will be doing is based on what objective you are trying to get. However, this article will give you(as a beginner in data analysis, data mining) about what dataset is, where to get it, and how to do basic EDA in python.

Get the entire working Jupyter notebook in my GitHub

Ahh! Tired??

Do you want one magical line of code that can build an entire EDA cycle for you: Check out my medium article

Stay tuned for next Data Science related Post.

## Analytics Vidhya

### By Analytics Vidhya

Launching the Second Data Science Blogathon – An Unmissable Chance to Write and Win Prizesprizes worth INR 30,000+! Take a look

Written by

Written by