Conceptual Skeleton to Solve Every EDA Ever!

Harshini Raju
CodeChef-VIT
Published in
4 min readJul 9, 2022
https://pianalytix.com/exploratory-data-analysis/
Exploratory Data Analysis

At this very moment on the Internet, there are about a billion blogs that explain “How to perform Exploratory Data Analysis” with a detailed list of the functions and codes to use. But do you ever really understand the conceptual flow of an EDA?

Well, you’ve come to the right place!

EDA- Exploratory Data Analysis, is the initial process of performing basic operations to identify patterns, anomalies, and hypotheses from the data you have.

Literally, anyone who ever starts with Data Science or Machine Learning starts by doing EDA on one of the following five most popular datasets-

  1. Iris Flower Dataset
  2. Titanic Dataset
  3. Train Dataset
  4. Breast Cancer Wisconsin Dataset
  5. Housing Dataset

All of these are available on Kaggle and there are about gazillion solutions for each of these datasets. These solutions might help you do the first 10 or even 20 EDAs but at the end of the day, you should need to understand the point of the functions.

https://devopedia.org/exploratory-data-analysis

In my experience of doing EDA, I figured out a simple skeletal pattern that can be applied to every dataset.

Here are the basic steps based on which you can solve any EDA-based problem:

  1. Identify the objective of the EDA

While this may sound obvious, we forget to understand why we are doing it when jumping into an EDA. It is important to understand the final goal to figure out a plan of attack- especially when trying to tackle real-life data.

2. Identify target attribute(s)

In a dataset, the columns are the attributes and when performing an EDA, there might be one or more target attributes based on which our outcome is defined.

For instance, in the Iris flower dataset, the category of the flower is the primary target attribute. Hence the purpose of the EDA is to identify patterns and arrive at hypotheses that determine the influence of the other attributes on the target attribute

3. Cleaning before analysing

The gospel of Data Science is to clean the data before proceeding to do anything! Always ensure you work with clean data- without null values, no corrupt data, and without missing data. It should become second nature to start by identifying the discrepancies and cleaning data.

https://github.com/HarshiniR4/Books-Recommendation-Engine/blob/main/Book%20Recommendation.ipynb
Cleaning and fixing the data before processing it. Reference: https://github.com/HarshiniR4/Books-Recommendation-Engine

4. Start with the numerical details

Now that we have clean data, we can begin converting the data into useful information. First and foremost, you look into all the numerical details that provide an understanding of the data.

You can start with the basics- mean, median, and move up to variance. Here is where it gets real handy to have an adept understanding of statistics. But along with knowing the concepts, you should also understand how it affects the data and the interpretations from the result of the functions you use.

5. Visualisation, Visualisations and… more visualisations!

The key to a good EDA is to get the right inferences from the vast data and for that, you require visual representations. Personally, I have the most trying out every kind of graph and visualization tool python has to offer and the results are always invigorating to look at and provide comprehensive insights into your data. Also, it’s really pretty to look at so- bonus!

6. Using more EDA tools

Once you understand the rudimentary stats of the data, it is time to move on to the ever-so-famous concepts of clustering, classification, bivariate, and multivariate analysis, and other predictive models. Thanks to the overflow of information on the net, I am not going to detail the functions, but feel free to check out the links below for EDA examples (self-promotion *wink*).

Now that you have a pretty-looking, insightful and well-rounded EDA, you’re good to go ahead and build high-level ML and DL models!

Thanks for reading till here and while you’re here, I would like it if you leave a clap and check out my other blogs!

--

--

Harshini Raju
CodeChef-VIT

I’m a computer science student interested in Data Science and IoT. I also have a passion for writing and am a budding content writer.