EDA — EXPLORATORY DATA ANALYSIS

DEEP DIVE INTO EDA

Published in

The Startup

13 min readSep 10, 2020

Till now I’ve not explained what’s EDA and those who have come so far, CONGRATULATIONS!!!, we will now start the EDA process. Heads up before we dive.

• There are no thumb rules for EDA or not we would have a designated template

• It entirely depends on how your data is and the EDA will take up its on course based on your data
• Interrogate the data and it will speak

I’ll try my level best to explain you to get started and consider this as a kick start for any of your EDA.

So, million-dollar question, What the heck is EDA? In a layman terms, it’s a process where you try to get meaningful insights from given data. EDA is process which gives you confidence on your data and you are in better position to build your model.

The “D” data in EDA is the broadly classified into Structured and Unstructured form. Structured data is basically your tabular data, which tells the literal meaning in its current form, like sales data, banking transaction etc.

Unstructured data would be your video files, image files and audio files.

We will be focusing on Structured data in this post to understand the EDA process. The structured data is further divided as shown below. (Source — Intellspot.com)

I will be using Kaggle’s Pulsar Star data to move ahead with EDA.

Please make a note that data wouldn’t be same as we have in Kaggle. Also, try to make visuals simple(2D) to understand. Fancy stuff looks good, but hard to grasp without explanation.

LET’S GET STARTED

a. Importing Python Libraries

# Importing required libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#magic line to get visuals on same notebook
%matplotlib inline

Use below code only if your using GoogleColab. You will get a choose file button using which you can upload the data.

from google.colab import files
upload = files.upload()

To check is the file has been uploaded properly, write the below code.

This will display all the file available in your current working directory.

!ls

b. Download the data from Kaggle

Brewer's Friend Beer Recipes

Data on over 75,000 homemade beers

www.kaggle.com

If you are going to use this notebook more than once, it is preferable that you delete the files that you are going to download and decompress, to avoid errors and problems. ! rm -f beer-recipes.zip ! rm -f recipeData.csv ! rm -f styleData.csv https://github.com/Kaggle/kaggle-api

To download we are use Kaggle’s API ! kaggle datasets download -d jtrofe/beer-recipes Unzip the file downloaded ! unzip beer-recipes.zip

c. Loading the data

Now let’s load the file into pandas data frame using below code. Please make a note that, in this problem the data was provided in csv format, hence we have used “read_csv” method. However, in real world you would be provided with different format and you have to use different method based on file format. GOOGLE IT!!!!!!!!

raw_data = pd.read_csv("recipeData.csv", encoding='latin-1')

Until here, you have just loaded the raw data, and you have no idea what’s there in this file. Let’s now peek into the data by using below codes. This will help you to get the gist of how data look like in the file provided. Numbers inside the bracket can be changed based on how many lines you want. By default, it will throw 5 lines. Head is for lines from top and Tail is for lines from bottom.

raw_data.head()

raw_data.tail(2)

#provides the no. of rows/columns
raw_data.shape(73861, 23)# Select Categorical variables
raw_data.describe(include=np.object)

d. Checking Data Types

For now, looking at the 15 lines we can assume that all the columns are of float (with decimals) nature. Columns in Machine Learning are called as Features or Independent Variables and Target column or the column which we want to predict are called Labels/Dependent Variables. Let’s confirm our assumptions. Below code will let you know the data type of all the fields in your data.

raw_data.dtypesBeerID             int64
Name              object
URL               object
Style             object
StyleID            int64
Size(L)          float64
OG               float64
FG               float64
ABV              float64
IBU              float64
Color            float64
BoilSize         float64
BoilTime           int64
BoilGravity      float64
Efficiency       float64
MashThickness    float64
SugarScale        object
BrewMethod        object
PitchRate        float64
PrimaryTemp      float64
PrimingMethod     object
PrimingAmount     object
UserId           float64
dtype: object

With these two lines of code you now have a brief idea about what my data is and its structure. Please note that meaning of each column are domain specific and before starting EDA you should get hold of these meaning. Do lot of research to understand the implication of these features

Let’s now run a code to check certain statistics on each feature/column.

raw_data.describe()

With this one line of code you get most of the statistics inference of your data

• Count – Number of items in that column
• Mean – Average of that column
• Std – Standard Deviation
• Min – Minimum value in that column
• Max – Maximum value in that column
• 25% - 25% of your data is below this value
• 50% - 50 % of your data is below this value
• 75% - 75% of your data is below this value

With this we can infer that not all the feature/columns have all the data. We have some missing values. And with percentile number and mean we can gauge the central tendency of the data.

Please note that describe will only provide data for Quantitative data and for categorical you need to mention include = “all” in the brackets. raw_data.describe(include = ‘all’).

Now, we know this is a classification problem, where we need to predict based on the features whether it’s a pulsar star or not. So, 1 in target column means it’s a pulsar star and 0 means it’s not a pulsar star.

In real world you normally get entire set of data and you divide the set into Training, Validation and Test. In this problem, Kaggle has provided 2 sets Training which we loaded and Test set where we need to test our model.

In this post we are doing EDA only on train set to keep things simple. But make sure if you are making any changes to train data, like changing column name or converting the values of the column, make sure you are changing the same on test data.

Now, let check our target variable, how many are “All Grain”,”extract”,”BIAB” or “Partial Mash”. Also, if we have any more category in our target variable.

sns.catplot(x ="BrewMethod", kind ="count", data =raw_data)
raw_data.BrewMethod.value_counts(1)*100All Grain       67.277724
BIAB            16.268396
extract         11.678694
Partial Mash     4.775186
Name: BrewMethod, dtype: float64

It’s now evident that we only have four class in our target variable (“All Grain”,”extract”,”BIAB” and “Partial Mash”), however, we have 67% of data with “All Grain”, 16% “BIAB”, 11% “extract” and only 4% are “Partial Mash”. It’s highly imbalance data. When I say it’s an imbalance data, it means any one category is dominant and when we build the model, its most likely to predict that category. Ideally, anything which is above 80:20 proportion is not an imbalance data. And in real world you will always get imbalance data. How to treat this imbalance data is not in scope of this post.

e. Dropping Columns

At times you get data where not all columns are relevant, though we don’t have any irrelevant column as of now but I’ll let you know how to remove the columns. Below axis = 1 means columns and axis = 0 means row

raw_data = raw_data.drop(["BeerID"], axis=1)

f. Changing Column Name

When you pull data from different system, there are always chances of some wired column name so to change the column name we can use below code and its always best practice to have some meaningful name. I’ll show you how to change the name of the output column in this case.

raw_data = raw_data.rename(columns={"StyleID":"ID"})
raw_data.head()

G. Checking Duplicate Rows

Most of the time when we pull data, there are high probability of duplicate entries and identifying those at the start would be beneficial.

duplicate_rows_raw = raw_data[raw_data.duplicated()]
print("Duplicate rows: ", duplicate_rows_raw.shape)Duplicate rows:  (0, 22)duplicate_rows_raw

h. Checking Missing Values

Missing value in a field may be because of incorrect data entry or data is not mandatory in that case. To get good model accuracy, we need to treat this missing value. Either to drop or replace it with mean, median or mode based on scenario.

sns.heatmap(raw_data.isnull(), yticklabels=False, cbar=False, cmap='viridis')<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf3be92d0>

# Drop data column
raw_data = raw_data.drop(["Name","URL","ID","PrimingMethod","PrimingAmount","UserId","MashThickness","PitchRate","PrimaryTemp"], axis=1)# Finding the null values.
total =raw_data.isnull().sum().sort_values(ascending=False)
print(total)BoilGravity    2990
Style           596
BrewMethod        0
SugarScale        0
Efficiency        0
BoilTime          0
BoilSize          0
Color             0
IBU               0
ABV               0
FG                0
OG                0
Size(L)           0
dtype: int64# Drop null values

raw_data = raw_data.dropna()
total =raw_data.isnull().sum().sort_values(ascending=False)
print(total)BrewMethod     0
SugarScale     0
Efficiency     0
BoilGravity    0
BoilTime       0
BoilSize       0
Color          0
IBU            0
ABV            0
FG             0
OG             0
Size(L)        0
Style          0
dtype: int64raw_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 70517 entries, 0 to 73860
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Style        70517 non-null  object 
 1   Size(L)      70517 non-null  float64
 2   OG           70517 non-null  float64
 3   FG           70517 non-null  float64
 4   ABV          70517 non-null  float64
 5   IBU          70517 non-null  float64
 6   Color        70517 non-null  float64
 7   BoilSize     70517 non-null  float64
 8   BoilTime     70517 non-null  int64  
 9   BoilGravity  70517 non-null  float64
 10  Efficiency   70517 non-null  float64
 11  SugarScale   70517 non-null  object 
 12  BrewMethod   70517 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 7.5+ MB# Convert columns dtype object to factor
raw_data['SugarScale'] = raw_data['SugarScale'].astype('category')
raw_data['BrewMethod'] = raw_data['BrewMethod'].astype('category')

We can clearly observe that there are 3 columns which have missing values. Now it’s a call you have to make, should we impute the columns which have missing values or replace it with come constant value or random value. It’s altogether different topic and will come up with a post that will tackle this problem.

Please note at times the above code will show zero missing values, but there may be some garbage values in the field which python with consider it as a valid entry, if this is the case you need to meet the domain expert to check this value.

i. Detecting Outliers

Outliers are basically data points or set of points that are not in sync with other data sets. With this task we need to identify the outliers and check with domain expert on how treat this, is it because of error or its one of a case. Treating outliers help in simplifying the data. Below code will let us know visually how each variable stands in terms of Outlier.

plt.figure(figsize=(5,5))
plt.title('Box Plot',fontsize=10)
raw_data.boxplot(vert=0)<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf5982910>

Its quite evident that all the variables have outliers and this would be the case in real world. However, we can see that “Skewness of the DM-SNR curve” and “Mean of the DM-SNR curve” are major contributors in outlier.

j. Checking Distribution

The process of standardizing corresponds to equalizing the information in its same scale. We are going to use skearn library

from sklearn import preprocessing
# select numeric columns
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = raw_data.select_dtypes(include=numerics)
newdf.head()

x = newdf.values #returns a numpy array
xarray([[21.77 ,  1.055,  1.013, ..., 75.   ,  1.038, 70.   ],
       [20.82 ,  1.083,  1.021, ..., 60.   ,  1.07 , 70.   ],
       [50.   ,  1.06 ,  1.01 , ..., 90.   ,  1.05 , 72.   ],
       ...,
       [10.   ,  1.059,  1.01 , ..., 60.   ,  1.034, 60.   ],
       [24.   ,  1.051,  1.014, ..., 60.   ,  1.043, 72.   ],
       [20.   ,  1.06 ,  1.01 , ..., 60.   ,  1.056, 70.   ]])min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(newdf)data_scale = pd.DataFrame(x_scaled)
data_scale.columns= ['Size(L)' ,'OG' ,'FG','ABV','IBU','Color' ,'BoilSize' ,'BoilTime' ,'BoilGravity' ,'Efficiency']
data_scale

# Now check outliers
plt.figure(figsize=(5,5))
plt.title('Box Plot',fontsize=10)
data_scale.boxplot(vert=0)<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf57df350>

k. Checking Distribution

Distribution again is one of the methods to check how your dependent variable is distributed and if we can treat to simplify the data sets.

raw_data.hist(figsize=(15, 5), bins=50,
xlabelsize=5, ylabelsize=5)array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf584edd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf56f3a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5680e10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5641650>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf55f7e50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf55b8690>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf556de90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf552d6d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5538250>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf54ecbd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5456f10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf5419750>]],
      dtype=object)

Now, looking at the above plot, its evident that there is bit of skewness in every variable, but Mean & SD of DM-SM curve are heavily right skewed.

data_scale.hist(figsize=(15, 5), bins=50,
xlabelsize=5, ylabelsize=5)array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf56fbed0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa692b90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa4d9e50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa3057d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfaac5890>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8d0abfef50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f8d0a5a2950>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf60e4dd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cf59e9950>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa699310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa2f2650>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f8cfa4cce50>]],
      dtype=object)

As you can see the distribution in data with or without scaling are the same

l. Finding Correlation

Major weightage of EDA is on correlations, and often correlations are really interesting and used to find the features which are highly correlated with our dependent variable, but not always.

# Get the n most frequent values,
n = 5
list_top=raw_data['Style'].value_counts()[:n].index.tolist()
list_top['American IPA',
 'American Pale Ale',
 'Saison',
 'American Light Lager',
 'American Amber Ale']sns.catplot(x ="Style", kind ="count", data =raw_data)
raw_data.Style.value_counts(1)*100American IPA              16.370521
American Pale Ale         10.391820
Saison                     3.584951
American Light Lager       3.229009
American Amber Ale         2.746855
                            ...    
Lichtenhainer              0.008509
Apple Wine                 0.008509
Pyment (Grape Melomel)     0.005672
Traditional Perry          0.002836
French Cider               0.001418
Name: Style, Length: 175, dtype: float64

data = raw_data[raw_data['Style'].isin(list_top)]sns.catplot(x ="Style", kind ="count", data =data)
data.Style.value_counts(1)*100American IPA            45.069103
American Pale Ale       28.609354
Saison                   9.869603
American Light Lager     8.889670
American Amber Ale       7.562271
Name: Style, dtype: float64

# Select Categorical variables
data.describe(include=np.object)

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

# reduce dataset size
data_sample=data.sample(n=1000, random_state=1)
data_sample

plt.figure(figsize=(12,7))
sns.heatmap(data_sample.corr(), annot=True, cmap='Blues')<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf5836910>

# Drop data column
data_sample = data_sample.drop(["OG","FG","Size(L)","SugarScale","BrewMethod","BoilTime","BoilGravity","Efficiency"], axis=1)data_sample.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 25893 to 44807
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Style     1000 non-null   object 
 1   ABV       1000 non-null   float64
 2   IBU       1000 non-null   float64
 3   Color     1000 non-null   float64
 4   BoilSize  1000 non-null   float64
dtypes: float64(4), object(1)
memory usage: 46.9+ KBsns.pairplot(data_sample, hue="Style")<seaborn.axisgrid.PairGrid at 0x7f8cf44fc4d0>

The image is not clearly visible, you can check it on the colab which I’ve shared at the end of this module. From this graph, we can clearly figure out relation between two variables and how it impacts the output.

plt.figure(figsize=(12,7))
sns.heatmap(data_sample.corr(), annot=True, cmap='Blues')<matplotlib.axes._subplots.AxesSubplot at 0x7f8cf0940c10>

It’s evident that, “Excess kurtosis of the integrated profile” and “Skewness of the integrated profile” are highly corelated with dependent variable. Darker the blue higher is the positive correlation and vice versa.

Conclusion

If you have come so far, Congratulations!!! and you would have got a gist that — there is no one size fits all for EDA. In this article, I have tried to give you basic idea on how to start, so that you don’t think to much on where and how to go about this EDA process Out there all datasets would always have missing values, errors in the data, unbalanced data, and biased data. EDA is the first step in tackling a data science project to learn what data we have and evaluate its validity. Before I sign-off, do check on Pandas Profiling and SweetViz as an alternative for what I’ve done above.

I hope you like it.

No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything can look like “Outside the Curriculum”.

The best way to learn is by doing! The best way to learn is by teaching what you have learned!

Never give up!

See you on Linkedin!

Oscar Rojo Martín - Studing Data Science at Universidad de Deusto - San Sebastián, Basque Country, Spain | LinkedIn

www.linkedin.com

References:

https://www.linkedin.com/posts/shibunair_eda-exploratory-data-analysis-activity-6709452412752547841-F1Jc