# EDA — EXPLORATORY DATA ANALYSIS

## DEEP DIVE INTO EDA

Till now I’ve not explained what’s EDA and those who have come so far, CONGRATULATIONS!!!, we will now start the EDA process. Heads up before we dive.

`• There are no thumb rules for EDA or not we would have a designated template`

• It entirely depends on how your data is and the EDA will take up its on course based on your data

• Interrogate the data and it will speak

I’ll try my level best to explain you to get started and consider this as a kick start for any of your EDA.

So, million-dollar question, What the heck is EDA? In a layman terms, it’s a process where you try to get meaningful insights from given data. EDA is process which gives you confidence on your data and you are in better position to build your model.

The “D” data in EDA is the broadly classified into Structured and Unstructured form. Structured data is basically your tabular data, which tells the literal meaning in its current form, like sales data, banking transaction etc.

Unstructured data would be your video files, image files and audio files.

We will be focusing on Structured data in this post to understand the EDA process. The structured data is further divided as shown below. (Source — Intellspot.com)

I will be using Kaggle’s Pulsar Star data to move ahead with EDA.

Please make a note that data wouldn’t be same as we have in Kaggle. Also, try to make visuals simple(2D) to understand. Fancy stuff looks good, but hard to grasp without explanation.

## LET’S GET STARTED

## a. Importing Python Libraries

`# Importing required libraries.`

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

#magic line to get visuals on same notebook

%matplotlib inline

Use below code only if your using GoogleColab. You will get a choose file button using which you can upload the data.

from google.colab import files

upload = files.upload()

To check is the file has been uploaded properly, write the below code.

This will display all the file available in your current working directory.

!ls

## b. Download the data from Kaggle

If you are going to use this notebook more than once, it is preferable that you delete the files that you are going to download and decompress, to avoid errors and problems. ! rm -f beer-recipes.zip ! rm -f recipeData.csv ! rm -f styleData.csv https://github.com/Kaggle/kaggle-api

To download we are use Kaggle’s API ! kaggle datasets download -d jtrofe/beer-recipes Unzip the file downloaded ! unzip beer-recipes.zip

## c. Loading the data

Now let’s load the file into pandas data frame using below code. Please make a note that, in this problem the data was provided in csv format, hence we have used “read_csv” method. However, in real world you would be provided with different format and you have to use different method based on file format. GOOGLE IT!!!!!!!!

`raw_data = pd.read_csv("recipeData.csv", encoding='latin-1')`

Until here, you have just loaded the raw data, and you have no idea what’s there in this file. Let’s now peek into the data by using below codes. This will help you to get the gist of how data look like in the file provided. Numbers inside the bracket can be changed based on how many lines you want. By default, it will throw 5 lines. Head is for lines from top and Tail is for lines from bottom.

`raw_data.head()`

`raw_data.tail(2)`

#provides the no. of rows/columns

raw_data.shape(73861, 23)# Select Categorical variables

raw_data.describe(include=np.object)

## d. Checking Data Types

For now, looking at the 15 lines we can assume that all the columns are of float (with decimals) nature. Columns in Machine Learning are called as Features or Independent Variables and Target column or the column which we want to predict are called Labels/Dependent Variables. Let’s confirm our assumptions. Below code will let you know the data type of all the fields in your data.

raw_data.dtypesBeerID int64

Name object

URL object

Style object

StyleID int64

Size(L) float64

OG float64

FG float64

ABV float64

IBU float64

Color float64

BoilSize float64

BoilTime int64

BoilGravity float64

Efficiency float64

MashThickness float64

SugarScale object

BrewMethod object

PitchRate float64

PrimaryTemp float64

PrimingMethod object

PrimingAmount object

UserId float64

dtype: object

With these two lines of code you now have a brief idea about what my data is and its structure. *Please note that meaning of each column are domain specific and before starting EDA you should get hold of these meaning. Do lot of research to understand the implication of these features*

Let’s now run a code to check certain statistics on each feature/column.

`raw_data.describe()`

With this one line of code you get most of the statistics inference of your data

`• Count – Number of items in that column`

• Mean – Average of that column

• Std – Standard Deviation

• Min – Minimum value in that column

• Max – Maximum value in that column

• 25% - 25% of your data is below this value

• 50% - 50 % of your data is below this value

• 75% - 75% of your data is below this value

With this we can infer that not all the feature/columns have all the data. We have some missing values. And with percentile number and mean we can gauge the central tendency of the data.

*Please note that describe will only provide data for Quantitative data and for categorical you need to mention include = “all” in the brackets. raw_data.describe(include = ‘all’).*

Now, we know this is a classification problem, where we need to predict based on the features whether it’s a pulsar star or not. So, 1 in target column means it’s a pulsar star and 0 means it’s not a pulsar star.

In real world you normally get entire set of data and you divide the set into Training, Validation and Test. In this problem, Kaggle has provided 2 sets Training which we loaded and Test set where we need to test our model.

In this post we are doing EDA only on train set to keep things simple. But make sure if you are making any changes to train data, like changing column name or converting the values of the column, make sure you are changing the same on test data.

Now, let check our target variable, how many are “All Grain”,”extract”,”BIAB” or “Partial Mash”. Also, if we have any more category in our target variable.

sns.catplot(x ="BrewMethod", kind ="count", data =raw_data)

raw_data.BrewMethod.value_counts(1)*100All Grain 67.277724

BIAB 16.268396

extract 11.678694

Partial Mash 4.775186

Name: BrewMethod, dtype: float64

It’s now evident that we only have four class in our target variable (“All Grain”,”extract”,”BIAB” and “Partial Mash”), however, we have 67% of data with “All Grain”, 16% “BIAB”, 11% “extract” and only 4% are “Partial Mash”. It’s highly imbalance data. When I say it’s an imbalance data, it means any one category is dominant and when we build the model, its most likely to predict that category. Ideally, anything which is above 80:20 proportion is not an imbalance data. And in real world you will always get imbalance data. How to treat this imbalance data is not in scope of this post.

## e. Dropping Columns

At times you get data where not all columns are relevant, though we don’t have any irrelevant column as of now but I’ll let you know how to remove the columns. Below axis = 1 means columns and axis = 0 means row

`raw_data = raw_data.drop(["BeerID"], axis=1)`

## f. Changing Column Name

When you pull data from different system, there are always chances of some wired column name so to change the column name we can use below code and its always best practice to have some meaningful name. I’ll show you how to change the name of the output column in this case.

`raw_data = raw_data.rename(columns={"StyleID":"ID"})`

raw_data.head()

## G. Checking Duplicate Rows

Most of the time when we pull data, there are high probability of duplicate entries and identifying those at the start would be beneficial.

duplicate_rows_raw = raw_data[raw_data.duplicated()]

print("Duplicate rows: ", duplicate_rows_raw.shape)Duplicate rows: (0, 22)duplicate_rows_raw

## h. Checking Missing Values

Missing value in a field may be because of incorrect data entry or data is not mandatory in that case. To get good model accuracy, we need to treat this missing value. Either to drop or replace it with mean, median or mode based on scenario.

sns.heatmap(raw_data.isnull(), yticklabels=False, cbar=False, cmap='viridis')<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf3be92d0>

# Drop data column

raw_data = raw_data.drop(["Name","URL","ID","PrimingMethod","PrimingAmount","UserId","MashThickness","PitchRate","PrimaryTemp"], axis=1)# Finding the null values.

total =raw_data.isnull().sum().sort_values(ascending=False)

print(total)BoilGravity 2990

Style 596

BrewMethod 0

SugarScale 0

Efficiency 0

BoilTime 0

BoilSize 0

Color 0

IBU 0

ABV 0

FG 0

OG 0

Size(L) 0

dtype: int64# Drop null values

raw_data = raw_data.dropna()

total =raw_data.isnull().sum().sort_values(ascending=False)

print(total)BrewMethod 0

SugarScale 0

Efficiency 0

BoilGravity 0

BoilTime 0

BoilSize 0

Color 0

IBU 0

ABV 0

FG 0

OG 0

Size(L) 0

Style 0

dtype: int64raw_data.info()<class 'pandas.core.frame.DataFrame'>

Int64Index: 70517 entries, 0 to 73860

Data columns (total 13 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Style 70517 non-null object

1 Size(L) 70517 non-null float64

2 OG 70517 non-null float64

3 FG 70517 non-null float64

4 ABV 70517 non-null float64

5 IBU 70517 non-null float64

6 Color 70517 non-null float64

7 BoilSize 70517 non-null float64

8 BoilTime 70517 non-null int64

9 BoilGravity 70517 non-null float64

10 Efficiency 70517 non-null float64

11 SugarScale 70517 non-null object

12 BrewMethod 70517 non-null object

dtypes: float64(9), int64(1), object(3)

memory usage: 7.5+ MB# Convert columns dtype object to factor

raw_data['SugarScale'] = raw_data['SugarScale'].astype('category')

raw_data['BrewMethod'] = raw_data['BrewMethod'].astype('category')

We can clearly observe that there are 3 columns which have missing values. Now it’s a call you have to make, should we impute the columns which have missing values or replace it with come constant value or random value. It’s altogether different topic and will come up with a post that will tackle this problem.

*Please note at times the above code will show zero missing values, but there may be some garbage values in the field which python with consider it as a valid entry, if this is the case you need to meet the domain expert to check this value.*

## i. Detecting Outliers

Outliers are basically data points or set of points that are not in sync with other data sets. With this task we need to identify the outliers and check with domain expert on how treat this, is it because of error or its one of a case. Treating outliers help in simplifying the data. Below code will let us know visually how each variable stands in terms of Outlier.

plt.figure(figsize=(5,5))

plt.title('Box Plot',fontsize=10)

raw_data.boxplot(vert=0)<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf5982910>

Its quite evident that all the variables have outliers and this would be the case in real world. However, we can see that “Skewness of the DM-SNR curve” and “Mean of the DM-SNR curve” are major contributors in outlier.

## j. Checking Distribution

The process of standardizing corresponds to equalizing the information in its same scale. We are going to use skearn library

`from sklearn import preprocessing`

# select numeric columns

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = raw_data.select_dtypes(include=numerics)

newdf.head()

x = newdf.values #returns a numpy array

xarray([[21.77 , 1.055, 1.013, ..., 75. , 1.038, 70. ],

[20.82 , 1.083, 1.021, ..., 60. , 1.07 , 70. ],

[50. , 1.06 , 1.01 , ..., 90. , 1.05 , 72. ],

...,

[10. , 1.059, 1.01 , ..., 60. , 1.034, 60. ],

[24. , 1.051, 1.014, ..., 60. , 1.043, 72. ],

[20. , 1.06 , 1.01 , ..., 60. , 1.056, 70. ]])min_max_scaler = preprocessing.MinMaxScaler()

x_scaled = min_max_scaler.fit_transform(newdf)data_scale = pd.DataFrame(x_scaled)

data_scale.columns= ['Size(L)' ,'OG' ,'FG','ABV','IBU','Color' ,'BoilSize' ,'BoilTime' ,'BoilGravity' ,'Efficiency']

data_scale

# Now check outliers

plt.figure(figsize=(5,5))

plt.title('Box Plot',fontsize=10)

data_scale.boxplot(vert=0)<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf57df350>

## k. Checking Distribution

Distribution again is one of the methods to check how your dependent variable is distributed and if we can treat to simplify the data sets.

raw_data.hist(figsize=(15, 5), bins=50,

xlabelsize=5, ylabelsize=5)array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf584edd0>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf56f3a90>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5680e10>],

[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5641650>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf55f7e50>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf55b8690>],

[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf556de90>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf552d6d0>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5538250>],

[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf54ecbd0>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5456f10>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5419750>]],

dtype=object)

Now, looking at the above plot, its evident that there is bit of skewness in every variable, but Mean & SD of DM-SM curve are heavily right skewed.

data_scale.hist(figsize=(15, 5), bins=50,

xlabelsize=5, ylabelsize=5)array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf56fbed0>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa692b90>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa4d9e50>],

[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa3057d0>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfaac5890>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8d0abfef50>],

[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8d0a5a2950>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf60e4dd0>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf59e9950>],

[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa699310>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa2f2650>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa4cce50>]],

dtype=object)

As you can see the distribution in data with or without scaling are the same

## l. Finding Correlation

Major weightage of EDA is on correlations, and often correlations are really interesting and used to find the features which are highly correlated with our dependent variable, but not always.

# Get the n most frequent values,

n = 5

list_top=raw_data['Style'].value_counts()[:n].index.tolist()

list_top['American IPA',

'American Pale Ale',

'Saison',

'American Light Lager',

'American Amber Ale']sns.catplot(x ="Style", kind ="count", data =raw_data)

raw_data.Style.value_counts(1)*100American IPA 16.370521

American Pale Ale 10.391820

Saison 3.584951

American Light Lager 3.229009

American Amber Ale 2.746855

...

Lichtenhainer 0.008509

Apple Wine 0.008509

Pyment (Grape Melomel) 0.005672

Traditional Perry 0.002836

French Cider 0.001418

Name: Style, Length: 175, dtype: float64

data = raw_data[raw_data['Style'].isin(list_top)]sns.catplot(x ="Style", kind ="count", data =data)

data.Style.value_counts(1)*100American IPA 45.069103

American Pale Ale 28.609354

Saison 9.869603

American Light Lager 8.889670

American Amber Ale 7.562271

Name: Style, dtype: float64

`# Select Categorical variables`

data.describe(include=np.object)

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

`# reduce dataset size`

data_sample=data.sample(n=1000, random_state=1)

data_sample

plt.figure(figsize=(12,7))

sns.heatmap(data_sample.corr(), annot=True, cmap='Blues')<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf5836910>

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

# Drop data column

data_sample = data_sample.drop(["OG","FG","Size(L)","SugarScale","BrewMethod","BoilTime","BoilGravity","Efficiency"], axis=1)data_sample.info()<class 'pandas.core.frame.DataFrame'>

Int64Index: 1000 entries, 25893 to 44807

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Style 1000 non-null object

1 ABV 1000 non-null float64

2 IBU 1000 non-null float64

3 Color 1000 non-null float64

4 BoilSize 1000 non-null float64

dtypes: float64(4), object(1)

memory usage: 46.9+ KBsns.pairplot(data_sample, hue="Style")<seaborn.axisgrid.PairGrid at 0x7f8cf44fc4d0>

The image is not clearly visible, you can check it on the colab which I’ve shared at the end of this module. From this graph, we can clearly figure out relation between two variables and how it impacts the output.

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

plt.figure(figsize=(12,7))

sns.heatmap(data_sample.corr(), annot=True, cmap='Blues')<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf0940c10>

It’s evident that, “Excess kurtosis of the integrated profile” and “Skewness of the integrated profile” are highly corelated with dependent variable. Darker the blue higher is the positive correlation and vice versa.

## Conclusion

If you have come so far, Congratulations!!! and you would have got a gist that — there is no one size fits all for EDA. In this article, I have tried to give you basic idea on how to start, so that you don’t think to much on where and how to go about this EDA process Out there all datasets would always have missing values, errors in the data, unbalanced data, and biased data. EDA is the first step in tackling a data science project to learn what data we have and evaluate its validity. Before I sign-off, do check on Pandas Profiling and SweetViz as an alternative for what I’ve done above.

I hope you like it.

No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything can look like “Outside the Curriculum”.

The best way to learn is by doing! The best way to learn is by teaching what you have learned!

## Never give up!

See you on Linkedin!