EDA — EXPLORATORY DATA ANALYSIS

DEEP DIVE INTO EDA

Oscar Rojo
The Startup
13 min readSep 10, 2020

--

Till now I’ve not explained what’s EDA and those who have come so far, CONGRATULATIONS!!!, we will now start the EDA process. Heads up before we dive.

I’ll try my level best to explain you to get started and consider this as a kick start for any of your EDA.

Photo by Jouwen Wang on Unsplash

So, million-dollar question, What the heck is EDA? In a layman terms, it’s a process where you try to get meaningful insights from given data. EDA is process which gives you confidence on your data and you are in better position to build your model.

The “D” data in EDA is the broadly classified into Structured and Unstructured form. Structured data is basically your tabular data, which tells the literal meaning in its current form, like sales data, banking transaction etc.

Unstructured data would be your video files, image files and audio files.

We will be focusing on Structured data in this post to understand the EDA process. The structured data is further divided as shown below. (Source — Intellspot.com)

I will be using Kaggle’s Pulsar Star data to move ahead with EDA.

Please make a note that data wouldn’t be same as we have in Kaggle. Also, try to make visuals simple(2D) to understand. Fancy stuff looks good, but hard to grasp without explanation.

LET’S GET STARTED

a. Importing Python Libraries

Use below code only if your using GoogleColab. You will get a choose file button using which you can upload the data.

from google.colab import files

upload = files.upload()

To check is the file has been uploaded properly, write the below code.

This will display all the file available in your current working directory.

!ls

b. Download the data from Kaggle

If you are going to use this notebook more than once, it is preferable that you delete the files that you are going to download and decompress, to avoid errors and problems. ! rm -f beer-recipes.zip ! rm -f recipeData.csv ! rm -f styleData.csv https://github.com/Kaggle/kaggle-api

To download we are use Kaggle’s API ! kaggle datasets download -d jtrofe/beer-recipes Unzip the file downloaded ! unzip beer-recipes.zip

c. Loading the data

Now let’s load the file into pandas data frame using below code. Please make a note that, in this problem the data was provided in csv format, hence we have used “read_csv” method. However, in real world you would be provided with different format and you have to use different method based on file format. GOOGLE IT!!!!!!!!

Until here, you have just loaded the raw data, and you have no idea what’s there in this file. Let’s now peek into the data by using below codes. This will help you to get the gist of how data look like in the file provided. Numbers inside the bracket can be changed based on how many lines you want. By default, it will throw 5 lines. Head is for lines from top and Tail is for lines from bottom.

png
png
png

d. Checking Data Types

For now, looking at the 15 lines we can assume that all the columns are of float (with decimals) nature. Columns in Machine Learning are called as Features or Independent Variables and Target column or the column which we want to predict are called Labels/Dependent Variables. Let’s confirm our assumptions. Below code will let you know the data type of all the fields in your data.

With these two lines of code you now have a brief idea about what my data is and its structure. Please note that meaning of each column are domain specific and before starting EDA you should get hold of these meaning. Do lot of research to understand the implication of these features

Let’s now run a code to check certain statistics on each feature/column.

png

With this one line of code you get most of the statistics inference of your data

With this we can infer that not all the feature/columns have all the data. We have some missing values. And with percentile number and mean we can gauge the central tendency of the data.

Please note that describe will only provide data for Quantitative data and for categorical you need to mention include = “all” in the brackets. raw_data.describe(include = ‘all’).

Now, we know this is a classification problem, where we need to predict based on the features whether it’s a pulsar star or not. So, 1 in target column means it’s a pulsar star and 0 means it’s not a pulsar star.

In real world you normally get entire set of data and you divide the set into Training, Validation and Test. In this problem, Kaggle has provided 2 sets Training which we loaded and Test set where we need to test our model.

In this post we are doing EDA only on train set to keep things simple. But make sure if you are making any changes to train data, like changing column name or converting the values of the column, make sure you are changing the same on test data.

Now, let check our target variable, how many are “All Grain”,”extract”,”BIAB” or “Partial Mash”. Also, if we have any more category in our target variable.

png

It’s now evident that we only have four class in our target variable (“All Grain”,”extract”,”BIAB” and “Partial Mash”), however, we have 67% of data with “All Grain”, 16% “BIAB”, 11% “extract” and only 4% are “Partial Mash”. It’s highly imbalance data. When I say it’s an imbalance data, it means any one category is dominant and when we build the model, its most likely to predict that category. Ideally, anything which is above 80:20 proportion is not an imbalance data. And in real world you will always get imbalance data. How to treat this imbalance data is not in scope of this post.

e. Dropping Columns

At times you get data where not all columns are relevant, though we don’t have any irrelevant column as of now but I’ll let you know how to remove the columns. Below axis = 1 means columns and axis = 0 means row

f. Changing Column Name

When you pull data from different system, there are always chances of some wired column name so to change the column name we can use below code and its always best practice to have some meaningful name. I’ll show you how to change the name of the output column in this case.

png

G. Checking Duplicate Rows

Most of the time when we pull data, there are high probability of duplicate entries and identifying those at the start would be beneficial.

png

h. Checking Missing Values

Missing value in a field may be because of incorrect data entry or data is not mandatory in that case. To get good model accuracy, we need to treat this missing value. Either to drop or replace it with mean, median or mode based on scenario.

png

We can clearly observe that there are 3 columns which have missing values. Now it’s a call you have to make, should we impute the columns which have missing values or replace it with come constant value or random value. It’s altogether different topic and will come up with a post that will tackle this problem.

Please note at times the above code will show zero missing values, but there may be some garbage values in the field which python with consider it as a valid entry, if this is the case you need to meet the domain expert to check this value.

i. Detecting Outliers

Outliers are basically data points or set of points that are not in sync with other data sets. With this task we need to identify the outliers and check with domain expert on how treat this, is it because of error or its one of a case. Treating outliers help in simplifying the data. Below code will let us know visually how each variable stands in terms of Outlier.

png

Its quite evident that all the variables have outliers and this would be the case in real world. However, we can see that “Skewness of the DM-SNR curve” and “Mean of the DM-SNR curve” are major contributors in outlier.

j. Checking Distribution

The process of standardizing corresponds to equalizing the information in its same scale. We are going to use skearn library

png
png
png

k. Checking Distribution

Distribution again is one of the methods to check how your dependent variable is distributed and if we can treat to simplify the data sets.

png

Now, looking at the above plot, its evident that there is bit of skewness in every variable, but Mean & SD of DM-SM curve are heavily right skewed.

png

As you can see the distribution in data with or without scaling are the same

l. Finding Correlation

Major weightage of EDA is on correlations, and often correlations are really interesting and used to find the features which are highly correlated with our dependent variable, but not always.

png
png
png

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

png
png

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

png

The image is not clearly visible, you can check it on the colab which I’ve shared at the end of this module. From this graph, we can clearly figure out relation between two variables and how it impacts the output.

Also, below visual will clearly indicate how independent variable are corelated, and if, we have independent variables which are highly corelated, we can drop those columns/features as one of them is sufficient.

png

It’s evident that, “Excess kurtosis of the integrated profile” and “Skewness of the integrated profile” are highly corelated with dependent variable. Darker the blue higher is the positive correlation and vice versa.

Conclusion

If you have come so far, Congratulations!!! and you would have got a gist that — there is no one size fits all for EDA. In this article, I have tried to give you basic idea on how to start, so that you don’t think to much on where and how to go about this EDA process Out there all datasets would always have missing values, errors in the data, unbalanced data, and biased data. EDA is the first step in tackling a data science project to learn what data we have and evaluate its validity. Before I sign-off, do check on Pandas Profiling and SweetViz as an alternative for what I’ve done above.

I hope you like it.

No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything can look like “Outside the Curriculum”.

The best way to learn is by doing! The best way to learn is by teaching what you have learned!

Never give up!

See you on Linkedin!

--

--

Oscar Rojo
The Startup

Master in Data Science. Passionate about learning new skills. Former branch risk analyst. https://www.linkedin.com/in/oscar-rojo-martin/. www.oscarrojo.es