How do you do EDA❓

Immanuel Sanka
8 min readSep 11, 2023

--

About a year ago, during a casual weekend discussion, I first heard the term EDA. I remember pondering: “isn’t this just another name for data analysis?” 🤔 If this thought has crossed your mind too, then perhaps you’re experiencing the same curiosity I did back then.

As a researcher 👓 I didn’t strictly adhere to the conventions of EDA. Why? Because the datasets I worked with were typically pre-cleaned, and I either knew what to do with them or simply emulated methodologies from prior research or publications. However, in the world of Data Science, EDA emerges as a critical phase. It’s here that we often encounter the famous phrase:

“Garbage in, garbage out!”

Yeah, it’s a bit harsh but the message is clear: the quality of your input data dictates the quality of your output. Probably, I can say that in a research world, if we have a research plan which is poorly designed, the data may not be enough/ eligible for a publication. It’s kinda the same thing, but from what I learned, EDA has its unique structure both technically and theoretically.

EDA stands for Exploratory Data Analysis. According to Igual and Seguí (2017), a primary objective of EDA is

“To visualize and summarize the sample distribution, thereby allowing us to make tentative assumptions about the population distribution.”

Being said, EDA gives us a comprehensive view of a dataset. When I began my journey with EDA, I thought this one is quite complicated and may require various profiling, for instance profiling the data type (quantitative or categorical). The analysis might encompass a range of statistical profiles: mean, median, mode, quantiles, percentiles, variance, and more. These profiles offer a snapshot or summary of the data. In the book I mentioned above, EDA extends beyond mere statistical profiling. It necessitates an analysis of data distribution, covering aspects like outlier detection, data asymmetry (e.g., skewness) profiling, distribution profiling (both partial distribution function — PDF and cumulative distribution function — CDF), and estimation (including variance, mean square error, etc.). (I recommend this book since it’s quite straight forward and provide quite good insights which needed to enter data science world). The good news, to date, we don’t really need to make these analysis one by one. There are some libraries which ease EDA including, SweetViz, D-Tale, Missingno, Sketch, YData Profiling if you can do a little Python or at least how to execute the codes in Google Colaboratory/ code notebook from Kaggle/ Jupyter. There is a post which offers a detailed explanation of these libraries and you can find it here. Since I learn about user-friendliness and my previous projects were around this corner, I feel that there might be some barrier in coding comprehension. Therefore, I tried to make this EDA dashboard.

EDA dashboard

If you read my previous posts, I have been building different dashboards and deployed it using Streamlit framework, e.g. data processing pipeline and Bio-stocks analytical dashboard. Now, I’d like to build the EDA dashboard using the same framework. The dashboard will include four points that cover the theory:

  1. Data summary 📓 (mean, median, quantiles, percentiles, variance, etc)
  2. Outlier profile 🔴
  3. Data asymmetry profile 📉 (e.g., skewness)
  4. Distribution profiling 📈 (PDF and CDF)

Since we got the theory, let’s start building it!

First and foremost, let’s begin by importing the necessary libraries and setting up a file upload placeholder capable of retrieving either .csv or .xlsx file formats. To offer a quick preview of the data, we’ll display its first five rows using st.write(data.head()).

You might observe that the data upload placeholder I have created is slightly more intricate than the standard st.file_uploader(). In my experience, especially from Kaggle, not all .csv files adhere to the same format. I sometimes need to experiment with various encodings to ensure the .csv file is readable. Otherwise, any Excel file should typically work seamlessly. 😃

import streamlit as st
import pandas as pd

uploaded_file = st.file_uploader("Choose a CSV or XLSX file", type=['csv', 'xlsx'])

if uploaded_file:
if uploaded_file.name.endswith('.csv'):
def try_read_csv(file, encodings=['utf-8', 'latin1', 'ISO-8859-1', 'cp1252']):
for encoding in encodings:
file.seek(0) # Reset file pointer to start
try:
return pd.read_csv(file, encoding=encoding)
except (UnicodeDecodeError, pd.errors.EmptyDataError):
pass
raise ValueError("None of the provided encodings worked or the file is empty!")
data = try_read_csv(uploaded_file)
st.write (data.head())
else:
data = pd.read_excel(uploaded_file)
st.write (data.head())

Since we have the data, now, we need to start putting the EDA workflow.

Data summary 📓

In Python, you can do this by describing the data and show data type from each of the column data. However, only numeric data which can be shown directly using this way. To find all data type and non-missing data, we can just use .info() function from Pandas library.

# For assesing numeric data, Pandas library has .describe() that can provide summary
numeric_describe = st.write(data.describe())

# This provide an overview of data types and non-missing values
data_info = st.write(data.info())

As you can see, the data summary can be shown as the table. From this summary, we can directly find whether the data contain quantifiable object or categorical data. Usually, we can already see whether there are some outliers or not, e.g. comparing the data distribution and analyze the gap between quantiles and percentiles. I just took a random data from Kaggle and here, we have 12 parameters consisted of three numeric with 9 categorical data types. You can start guessing about some anomalies, e.g. high standard deviation and maximum number in the birth rate and maximum number in the death rate. Do you see something else which raise some questions?

Data summaries 📓 includes data profiles that include data type, numeric data summary, and categorical data summary

Outlier profile 🔴

There are different approaches to do it. Following the theory Igual and Seguí (2017), we need to get two things:

  • Samples that are too far from Median
  • Samples whose values exceed the mean by 2 or 3 standard deviations

To make a quick profile, we can just plot them as a box plots to see whether the data distribution can help us finding some outlier. I usually use Matplotlib and Seaborn library, however, these two doesn’t provide quite nice interactivity. So, I moved to Plotly. For the numeric data, we can just plot it directly, however, for categorical data, counting for each group is mandatory to see the data counts. From the profile, we can take out some data that are too far from Median and some data depending on the standard deviation as it mentioned in the book. Or, we can just check the dots that are above the upper quartile (Q3). If you want to detect outlier by using upper and lower quartiles, we just need to measure the interquartile range (IQR) and use it in the formula below:

  • Outlier lower than Q1 -> values that are less than Q1 – 1.5 x IQR
  • Outlier higher than Q3 -> values that are more than Q3 + 1.5 x IQR

or in Python, we can just type:

outliers = df[(df['charges'] < Q1–1.5* IQR) | (df['charges'] > Q3+1.5* IQR)]
Boxplots showing the outliers in numeric data

Data asymmetry profile 📉

Let’s touch the data asymmetry profile! This is one of the things which I rarely do during my research, previously. Since the book doesn’t really explain what it is, I checked other resources from Jambu (1991). Based on this source, skewness can be defined as “the degree of asymmetry of a distribution”. However, for simple explanation, I would just refer this Medium post from Ashish Kumar Singh. Not only skewness, but in the post, Ashish also explains about kurtosis, or “the degree of peakedness of a distribution” (Jambu, 1991). However, the book from Igual and Seguí (2017) seems quite straightforward in explaining this part. Now, for the data asymmetry profile, I made two graphs that show single column/ data distribution (univariate) for skewness and one boxplot to show how long the “tail” would be and how “fat” the box can be.

Example of skewness analysis (left) and kurtosis profile (right)

Distribution profiling

After the asymmetry, the fourth component is distribution profiling (CDF and PDF). Even though I didn’t use this approach quite often, I did some analysis that require these profiles. Therefore, I am quite familiar with this analysis. These distributions, CDF and PDF, are meant to find inter-arrival time between events and density of an absolute continuous random variable, respectively (Igual and Seguí, 2017; Wikipedia, 2023). To simplify the explanation, both distributions provide an overview how the data behave in a singular or cumulative form.

Data distribution which represented by PDF and CDF

Estimation💭

To further assess the data, estimation is also suggested for EDA and also mentioned in the book. This estimation is meant to approximate the values of unknown parameters of the data (e.g. data distribution) (Igual and Seguí, 2017). However, approximation is suggested to be analyzed after outlier is being treated. So, it’s not just a guess! Since this post is meant to provide an example how we can make an EDA dashboard, let’s keep this one to the later post. For now, let me put a link to a reading for checking what estimation is. Medium post from Selva Raj here.

Image from Leads2business

So, what do you think about EDA which is discussed in the book from Igual and Seguí (2017)❓To me, it’s quite concise, yet, reading the book only in EDA part is not the best way to learn the actual Exploratory Data Analysis. In the previous pages, the book also discuss important parts, including variable description, data preparation and selection, filtering, and so on. These parts are important since we can avoid confusion in performing EDA. That’s why I said at the beginning that this is quite interesting both technically or theoretically. You can play around with your perspective and domain knowledge to go deeper in your analysis, but you could also get lost in a deep slumber of a garbage data and find nothing.

To me, EDA is important and necessary to assess your data before processing and analysis, even modeling!

What’s your thought about this?

Do you have a better approach to perform EDA?

What’s your strategy?

If you want to play around with the EDA dashboard that I built, you can check here 👇 :

easy-eda.streamlit.app

Till the next post!

References:

  1. Laura Igual and Santi Seguí. 2017. Introduction to Data Science — A Python Approach to Concepts, Techniques and Applications. pp: 32–50
  2. Jambu, M. (1991). Chapter 3–1-D Statistical Data Analysis. In M. Jambu (Ed.), Exploratory and Multivariate Data Analysis (pp. 27–62). Academic Press. https://doi.org/https://doi.org/10.1016/B978-0-08-092367-3.50007-1

Feel free to continue the discussion in my post! You can reach out to me on Instagram @immanuelsanka or Twitter (or now X) @I_Sanka or LinkedIn Immanuel Sanka if you’re interested in a chat, collaboration, joint project or even hire me? 😄 Let me know! About me? Check my first post here!

--

--

Immanuel Sanka
Immanuel Sanka

Written by Immanuel Sanka

Complicated bioinformatician/ data scientist/ analyst/ whatever suits me :)