EDA — EXPLORATORY DATA ANALYSIS

DEEP DIVE INTO EDA

Oscar Rojo
The Startup
13 min readSep 10, 2020

--

Till now I’ve not explained what’s EDA and those who have come so far, CONGRATULATIONS!!!, we will now start the EDA process. Heads up before we dive.

• There are no thumb rules for EDA or not we would have a designated template

• It entirely depends on how your data is and the EDA will take up its on course based on your data
• Interrogate the data and it will speak

I’ll try my level best to explain you to get started and consider this as a kick start for any of your EDA.

Photo by Jouwen Wang on Unsplash

So, million-dollar question, What the heck is EDA? In a layman terms, it’s a process where you try to get meaningful insights from given data. EDA is process which gives you confidence on your data and you are in better position to build your model.

The “D” data in EDA is the broadly classified into Structured and Unstructured form. Structured data is basically your tabular data, which tells the literal meaning in its current form, like sales data, banking transaction etc.

Unstructured data would be your video files, image files and audio files.

We will be focusing on Structured data in this post to understand the EDA process. The structured data is further divided as shown below. (Source — Intellspot.com)

I will be using Kaggle’s Pulsar Star data to move ahead with EDA.

Please make a note that data wouldn’t be same as we have in Kaggle. Also, try to make visuals simple(2D) to understand. Fancy stuff looks good, but hard to grasp without explanation.

LET’S GET STARTED

a. Importing Python Libraries

# Importing required libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#magic line to get visuals on same notebook
%matplotlib inline

Use below code only if your using GoogleColab. You will get a choose file button using which you can upload the data.

from google.colab import files

upload = files.upload()

To check is the file has been uploaded properly, write the below code.

This will display all the file available in your current working directory.

!ls

b. Download the data from Kaggle

If you are going to use this notebook more than once, it is preferable that you delete the files that you are going to download and decompress, to avoid errors and problems. ! rm -f beer-recipes.zip ! rm -f recipeData.csv ! rm -f styleData.csv https://github.com/Kaggle/kaggle-api

To download we are use Kaggle’s API ! kaggle datasets download -d jtrofe/beer-recipes Unzip the file downloaded ! unzip beer-recipes.zip

c. Loading the data

Now let’s load the file into pandas data frame using below code. Please make a note that, in this problem the data was provided in csv format, hence we have used “read_csv” method. However, in real world you would be provided with different format and you have to use different method based on file format. GOOGLE IT!!!!!!!!

raw_data = pd.read_csv("recipeData.csv", encoding='latin-1')

Until here, you have just loaded the raw data, and you have no idea what’s there in this file. Let’s now peek into the data by using below codes. This will help you to get the gist of how data look like in the file provided. Numbers inside the bracket can be changed based on how many lines you want. By default, it will throw 5 lines. Head is for lines from top and Tail is for lines from bottom.

raw_data.head()
png
raw_data.tail(2)
png
#provides the no. of rows/columns
raw_data.shape
(73861, 23)# Select Categorical variables
raw_data.describe(include=np.object)
png

d. Checking Data Types

For now, looking at the 15 lines we can assume that all the columns are of float (with decimals) nature. Columns in Machine Learning are called as Features or Independent Variables and Target column or the column which we want to predict are called Labels/Dependent Variables. Let’s confirm our assumptions. Below code will let you know the data type of all the fields in your data.

raw_data.dtypesBeerID             int64
Name object
URL object
Style object
StyleID int64
Size(L) float64
OG float64
FG float64
ABV float64
IBU float64
Color float64
BoilSize float64
BoilTime int64
BoilGravity float64
Efficiency float64
MashThickness float64
SugarScale object
BrewMethod object
PitchRate float64
PrimaryTemp float64
PrimingMethod object
PrimingAmount object
UserId float64
dtype: object

With these two lines of code you now have a brief idea about what my data is and its structure. Please note that meaning of each column are domain specific and before starting EDA you should get hold of these meaning. Do lot of research to understand the implication of these features

Let’s now run a code to check certain statistics on each feature/column.

raw_data.describe()
png

With this one line of code you get most of the statistics inference of your data

• Count – Number of items in that column
• Mean – Average of that column
• Std – Standard Deviation
• Min – Minimum value in that column
• Max – Maximum value in that column
• 25% - 25% of your data is below this value
• 50% - 50 % of your data is below this value
• 75% - 75% of your data is below this value

With this we can infer that not all the feature/columns have all the data. We have some missing values. And with percentile number and mean we can gauge the central tendency of the data.

Please note that describe will only provide data for Quantitative data and for categorical you need to mention include = “all” in the brackets. raw_data.describe(include = ‘all’).

Now, we know this is a classification problem, where we need to predict based on the features whether it’s a pulsar star or not. So, 1 in target column means it’s a pulsar star and 0 means it’s not a pulsar star.

In real world you normally get entire set of data and you divide the set into Training, Validation and Test. In this problem, Kaggle has provided 2 sets Training which we loaded and Test set where we need to test our model.

In this post we are doing EDA only on train set to keep things simple. But make sure if you are making any changes to train data, like changing column name or converting the values of the column, make sure you are changing the same on test data.

Now, let check our target variable, how many are “All Grain”,”extract”,”BIAB” or “Partial Mash”. Also, if we have any more category in our target variable.

sns.catplot(x ="BrewMethod", kind ="count", data =raw_data)
raw_data.BrewMethod.value_counts(1)*100
All Grain 67.277724
BIAB 16.268396
extract 11.678694
Partial Mash 4.775186
Name: BrewMethod, dtype: float64
png

It’s now evident that we only have four class in our target variable (“All Grain”,”extract”,”BIAB” and “Partial Mash”), however, we have 67% of data with “All Grain”, 16% “BIAB”, 11% “extract” and only 4% are “Partial Mash”. It’s highly imbalance data. When I say it’s an imbalance data, it means any one category is dominant and when we build the model, its most likely to predict that category. Ideally, anything which is above 80:20 proportion is not an imbalance data. And in real world you will always get imbalance data. How to treat this imbalance data is not in scope of this post.

e. Dropping Columns

At times you get data where not all columns are relevant, though we don’t have any irrelevant column as of now but I’ll let you know how to remove the columns. Below axis = 1 means columns and axis = 0 means row

raw_data = raw_data.drop(["BeerID"], axis=1)

f. Changing Column Name

When you pull data from different system, there are always chances of some wired column name so to change the column name we can use below code and its always best practice to have some meaningful name. I’ll show you how to change the name of the output column in this case.

raw_data = raw_data.rename(columns={"StyleID":"ID"})
raw_data.head()
png

G. Checking Duplicate Rows

Most of the time when we pull data, there are high probability of duplicate entries and identifying those at the start would be beneficial.

duplicate_rows_raw = raw_data[raw_data.duplicated()]
print("Duplicate rows: ", duplicate_rows_raw.shape)
Duplicate rows: (0, 22)duplicate_rows_raw
png

h. Checking Missing Values

Missing value in a field may be because of incorrect data entry or data is not mandatory in that case. To get good model accuracy, we need to treat this missing value. Either to drop or replace it with mean, median or mode based on scenario.

sns.heatmap(raw_data.isnull(), yticklabels=False, cbar=False, cmap='viridis')<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf3be92d0>
png
# Drop data column
raw_data = raw_data.drop(["Name","URL","ID","PrimingMethod","PrimingAmount","UserId","MashThickness","PitchRate","PrimaryTemp"], axis=1)
# Finding the null values.
total =raw_data.isnull().sum().sort_values(ascending=False)
print(total)
BoilGravity 2990
Style 596
BrewMethod 0
SugarScale 0
Efficiency 0
BoilTime 0
BoilSize 0
Color 0
IBU 0
ABV 0
FG 0
OG 0
Size(L) 0
dtype: int64
# Drop null values

raw_data = raw_data.dropna()
total =raw_data.isnull().sum().sort_values(ascending=False)
print(total)
BrewMethod 0
SugarScale 0
Efficiency 0
BoilGravity 0
BoilTime 0
BoilSize 0
Color 0
IBU 0
ABV 0
FG 0
OG 0
Size(L) 0
Style 0
dtype: int64
raw_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 70517 entries, 0 to 73860
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Style 70517 non-null object
1 Size(L) 70517 non-null float64
2 OG 70517 non-null float64
3 FG 70517 non-null float64
4 ABV 70517 non-null float64
5 IBU 70517 non-null float64
6 Color 70517 non-null float64
7 BoilSize 70517 non-null float64
8 BoilTime 70517 non-null int64
9 BoilGravity 70517 non-null float64
10 Efficiency 70517 non-null float64
11 SugarScale 70517 non-null object
12 BrewMethod 70517 non-null object
dtypes: float64(9), int64(1), object(3)
memory usage: 7.5+ MB
# Convert columns dtype object to factor
raw_data['SugarScale'] = raw_data['SugarScale'].astype('category')
raw_data['BrewMethod'] = raw_data['BrewMethod'].astype('category')

We can clearly observe that there are 3 columns which have missing values. Now it’s a call you have to make, should we impute the columns which have missing values or replace it with come constant value or random value. It’s altogether different topic and will come up with a post that will tackle this problem.

Please note at times the above code will show zero missing values, but there may be some garbage values in the field which python with consider it as a valid entry, if this is the case you need to meet the domain expert to check this value.

i. Detecting Outliers

Outliers are basically data points or set of points that are not in sync with other data sets. With this task we need to identify the outliers and check with domain expert on how treat this, is it because of error or its one of a case. Treating outliers help in simplifying the data. Below code will let us know visually how each variable stands in terms of Outlier.

plt.figure(figsize=(5,5))
plt.title('Box Plot',fontsize=10)
raw_data.boxplot(vert=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf5982910>
png

Its quite evident that all the variables have outliers and this would be the case in real world. However, we can see that “Skewness of the DM-SNR curve” and “Mean of the DM-SNR curve” are major contributors in outlier.

j. Checking Distribution

The process of standardizing corresponds to equalizing the information in its same scale. We are going to use skearn library

from sklearn import preprocessing
# select numeric columns
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = raw_data.select_dtypes(include=numerics)
newdf.head()
png
x = newdf.values #returns a numpy array
x
array([[21.77 , 1.055, 1.013, ..., 75. , 1.038, 70. ],
[20.82 , 1.083, 1.021, ..., 60. , 1.07 , 70. ],
[50. , 1.06 , 1.01 , ..., 90. , 1.05 , 72. ],
...,
[10. , 1.059, 1.01 , ..., 60. , 1.034, 60. ],
[24. , 1.051, 1.014, ..., 60. , 1.043, 72. ],
[20. , 1.06 , 1.01 , ..., 60. , 1.056, 70. ]])
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(newdf)
data_scale = pd.DataFrame(x_scaled)
data_scale.columns= ['Size(L)' ,'OG' ,'FG','ABV','IBU','Color' ,'BoilSize' ,'BoilTime' ,'BoilGravity' ,'Efficiency']
data_scale
png
# Now check outliers
plt.figure(figsize=(5,5))
plt.title('Box Plot',fontsize=10)
data_scale.boxplot(vert=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf57df350>
png

k. Checking Distribution

Distribution again is one of the methods to check how your dependent variable is distributed and if we can treat to simplify the data sets.

raw_data.hist(figsize=(15, 5), bins=50,
xlabelsize=5, ylabelsize=5)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf584edd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf56f3a90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5680e10>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5641650>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf55f7e50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf55b8690>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf556de90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf552d6d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5538250>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf54ecbd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5456f10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5419750>]],
dtype=object)
png

Now, looking at the above plot, its evident that there is bit of skewness in every variable, but Mean & SD of DM-SM curve are heavily right skewed.

data_scale.hist(figsize=(15, 5), bins=50,
xlabelsize=5, ylabelsize=5)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf56fbed0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa692b90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa4d9e50>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa3057d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfaac5890>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8d0abfef50>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8d0a5a2950>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf60e4dd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf59e9950>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa699310>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa2f2650>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa4cce50>]],
dtype=object)
png

As you can see the distribution in data with or without scaling are the same

l. Finding Correlation

Major weightage of EDA is on correlations, and often correlations are really interesting and used to find the features which are highly correlated with our dependent variable, but not always.

# Get the n most frequent values,
n = 5
list_top=raw_data['Style'].value_counts()[:n].index.tolist()
list_top
['American IPA',
'American Pale Ale',
'Saison',
'American Light Lager',
'American Amber Ale']
sns.catplot(x ="Style", kind ="count", data =raw_data)
raw_data.Style.value_counts(1)*100
American IPA 16.370521
American Pale Ale 10.391820
Saison 3.584951
American Light Lager 3.229009
American Amber Ale 2.746855
...
Lichtenhainer 0.008509
Apple Wine 0.008509
Pyment (Grape Melomel) 0.005672
Traditional Perry 0.002836
French Cider 0.001418
Name: Style, Length: 175, dtype: float64
png
data = raw_data[raw_data['Style'].isin(list_top)]sns.catplot(x ="Style", kind ="count", data =data)
data.Style.value_counts(1)*100
American IPA 45.069103
American Pale Ale 28.609354
Saison 9.869603
American Light Lager 8.889670
American Amber Ale 7.562271
Name: Style, dtype: float64
png
# Select Categorical variables
data.describe(include=np.object)
png

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

# reduce dataset size
data_sample=data.sample(n=1000, random_state=1)
data_sample
png
plt.figure(figsize=(12,7))
sns.heatmap(data_sample.corr(), annot=True, cmap='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf5836910>
png

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

# Drop data column
data_sample = data_sample.drop(["OG","FG","Size(L)","SugarScale","BrewMethod","BoilTime","BoilGravity","Efficiency"], axis=1)
data_sample.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 25893 to 44807
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Style 1000 non-null object
1 ABV 1000 non-null float64
2 IBU 1000 non-null float64
3 Color 1000 non-null float64
4 BoilSize 1000 non-null float64
dtypes: float64(4), object(1)
memory usage: 46.9+ KB
sns.pairplot(data_sample, hue="Style")<seaborn.axisgrid.PairGrid at 0x7f8cf44fc4d0>
png

The image is not clearly visible, you can check it on the colab which I’ve shared at the end of this module. From this graph, we can clearly figure out relation between two variables and how it impacts the output.

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

plt.figure(figsize=(12,7))
sns.heatmap(data_sample.corr(), annot=True, cmap='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf0940c10>
png

It’s evident that, “Excess kurtosis of the integrated profile” and “Skewness of the integrated profile” are highly corelated with dependent variable. Darker the blue higher is the positive correlation and vice versa.

Conclusion

If you have come so far, Congratulations!!! and you would have got a gist that — there is no one size fits all for EDA. In this article, I have tried to give you basic idea on how to start, so that you don’t think to much on where and how to go about this EDA process Out there all datasets would always have missing values, errors in the data, unbalanced data, and biased data. EDA is the first step in tackling a data science project to learn what data we have and evaluate its validity. Before I sign-off, do check on Pandas Profiling and SweetViz as an alternative for what I’ve done above.

I hope you like it.

No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything can look like “Outside the Curriculum”.

The best way to learn is by doing! The best way to learn is by teaching what you have learned!

Never give up!

See you on Linkedin!

--

--

Oscar Rojo
The Startup

Master in Data Science. Passionate about learning new skills. Former branch risk analyst. https://www.linkedin.com/in/oscar-rojo-martin/. www.oscarrojo.es