Conceptual Skeleton to Solve Every EDA Ever!

Published in

CodeChef-VIT

4 min readJul 9, 2022

https://pianalytix.com/exploratory-data-analysis/ — Exploratory Data Analysis

At this very moment on the Internet, there are about a billion blogs that explain “How to perform Exploratory Data Analysis” with a detailed list of the functions and codes to use. But do you ever really understand the conceptual flow of an EDA?

Well, you’ve come to the right place!

EDA- Exploratory Data Analysis, is the initial process of performing basic operations to identify patterns, anomalies, and hypotheses from the data you have.

Literally, anyone who ever starts with Data Science or Machine Learning starts by doing EDA on one of the following five most popular datasets-

Iris Flower Dataset
Titanic Dataset
Train Dataset
Breast Cancer Wisconsin Dataset
Housing Dataset

All of these are available on Kaggle and there are about gazillion solutions for each of these datasets. These solutions might help you do the first 10 or even 20 EDAs but at the end of the day, you should need to understand the point of the functions.

https://devopedia.org/exploratory-data-analysis

In my experience of doing EDA, I figured out a simple skeletal pattern that can be applied to every dataset.

Here are the basic steps based on which you can solve any EDA-based problem:

Identify the objective of the EDA

While this may sound obvious, we forget to understand why we are doing it when jumping into an EDA. It is important to understand the final goal to figure out a plan of attack- especially when trying to tackle real-life data.

2. Identify target attribute(s)

In a dataset, the columns are the attributes and when performing an EDA, there might be one or more target attributes based on which our outcome is defined.

For instance, in the Iris flower dataset, the category of the flower is the primary target attribute. Hence the purpose of the EDA is to identify patterns and arrive at hypotheses that determine the influence of the other attributes on the target attribute

3. Cleaning before analysing

The gospel of Data Science is to clean the data before proceeding to do anything! Always ensure you work with clean data- without null values, no corrupt data, and without missing data. It should become second nature to start by identifying the discrepancies and cleaning data.

https://github.com/HarshiniR4/Books-Recommendation-Engine/blob/main/Book%20Recommendation.ipynb — Cleaning and fixing the data before processing it. Reference: https://github.com/HarshiniR4/Books-Recommendation-Engine

4. Start with the numerical details

Now that we have clean data, we can begin converting the data into useful information. First and foremost, you look into all the numerical details that provide an understanding of the data.

You can start with the basics- mean, median, and move up to variance. Here is where it gets real handy to have an adept understanding of statistics. But along with knowing the concepts, you should also understand how it affects the data and the interpretations from the result of the functions you use.

5. Visualisation, Visualisations and… more visualisations!

The key to a good EDA is to get the right inferences from the vast data and for that, you require visual representations. Personally, I have the most trying out every kind of graph and visualization tool python has to offer and the results are always invigorating to look at and provide comprehensive insights into your data. Also, it’s really pretty to look at so- bonus!

6. Using more EDA tools

Once you understand the rudimentary stats of the data, it is time to move on to the ever-so-famous concepts of clustering, classification, bivariate, and multivariate analysis, and other predictive models. Thanks to the overflow of information on the net, I am not going to detail the functions, but feel free to check out the links below for EDA examples (self-promotion *wink*).

GitHub - HarshiniR4/Books-Recommendation-Engine

I am an avid reader and enjoy quite a few genres of books. Most of my reading options come from the internet or from a…

github.com

GitHub - HarshiniR4/UCI-Air-Quality-Dataset-EDA: ML based data analytics for IoT or WSN datasets…

ML based data analytics for IoT or WSN datasets from Kaggle or UCI data repository- need to compare the results with…

github.com

GitHub - HarshiniR4/Data-Classification-with-CIFAR-10-Dataset

CIFAR-10 dataset consists of images classified into 10 classes. In this solution I have utilised tensorflow, Keras and…

github.com

GitHub - HarshiniR4/NYC-Taxi-EDA-Project: Exploratory Data Analysis on Kaggle NYC Taxi Trip…

Exploratory Data Analysis on Kaggle NYC Taxi Trip Duration Dataset to predict total ride duration of taxi trips in NYC…

github.com

Now that you have a pretty-looking, insightful and well-rounded EDA, you’re good to go ahead and build high-level ML and DL models!

Thanks for reading till here and while you’re here, I would like it if you leave a clap and check out my other blogs!

Conceptual Skeleton to Solve Every EDA Ever!

GitHub - HarshiniR4/Books-Recommendation-Engine

I am an avid reader and enjoy quite a few genres of books. Most of my reading options come from the internet or from a…

GitHub - HarshiniR4/UCI-Air-Quality-Dataset-EDA: ML based data analytics for IoT or WSN datasets…

ML based data analytics for IoT or WSN datasets from Kaggle or UCI data repository- need to compare the results with…

GitHub - HarshiniR4/Data-Classification-with-CIFAR-10-Dataset

CIFAR-10 dataset consists of images classified into 10 classes. In this solution I have utilised tensorflow, Keras and…

GitHub - HarshiniR4/NYC-Taxi-EDA-Project: Exploratory Data Analysis on Kaggle NYC Taxi Trip…

Exploratory Data Analysis on Kaggle NYC Taxi Trip Duration Dataset to predict total ride duration of taxi trips in NYC…

Written by Harshini Raju