Automate the exploratory data analysis (EDA) to understand the data faster and easier
What is EDA?
EDA is one of the most important things we need to do as an approach to understand the dataset better. Almost all data analytics or data science professionals do this process before generating insights or doing data modeling. In real life, this process took a lot of time, depending on the complexity and completeness of the dataset we have. Of course, more variables make us explore more to get the summary we need before doing the next steps.
That’s why using R or Python, the most common programming language to do data analysis, some packages help to do that process faster and easier, but not better. Why not better? Because it only shows us a summary, before we focus to explore deeper any variables we find “interesting”.
The “80/20 rule” applies: 80 percent of a data analyst or scientist’s valuable time is spent simply finding, cleansing, and organizing data, leaving only 20 percent to perform analysis.
Which libraries we can use?
In R, we can use these libraries:
dataMaid
DataExplorer
SmartEDA
In Python, we can use these libraries:
ydata-profiling
dtale
sweetviz
autoviz
Let’s try each library listed above to know what they look like and how they can help us do exploratory data analysis! In this post, I will use the iris
dataset which is common to be used to learn how to code in R or Python.
In R, you can use this code to load the iris
dataset:
# iris is part of R's default, no need to load any packages
df = iris
# use "head()" to show the first 6 rows
head(df)
In Python, you can use this code to load the iris
dataset:
# need to import these things first
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
# use load_iris
iris = load_iris()
# convert into a pandas data frame
df = pd.DataFrame(
data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['species']
)
# set manually the species column as a categorical variable
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# use ".head()" to show the first 5 rows
df.head()
R: dataMaid
First, we need to execute the simple code below:
# install the dataMaid library
install.packages("dataMaid")
# load the dataMaid library
library(dataMaid)
# use makeDataReport with HTML as output
makeDataReport(df, output = "html", replace = TRUE)
From the first snapshot (Image 3), we already get a lot of information about the iris
dataset:
- The number of observations is 150.
- The number of variables is 5.
- Variable checks were performed, depending on the data type of each variable, such as identifying miscoded missing values, levels with < 6 obs, and outliers.
From the second snapshot (Image 4):
- The summary table of the variables includes variable class, unique values, missing observations, and any problems detected. We can see that
Sepal.Width
andPetal.Length
variables have problems detected. Sepal.Length
central measurements were provided including the histogram to give us the univariate distribution.Sepal.Width
has possible outlier values as listed. That’s why the summary table says problems detected.
From the third snapshot (Image 5):
Petal.Length
has possible outlier values as listed.Petal.Width
central measurements were provided including the histogram to give us the univariate distribution.Species
as target variable detected as afactor
and the count of the data is equal for each type, which is 50.
Based on the data report above created using dataMaid
in R, we already get a lot of information regarding the iris
dataset, only by executing a one-line code. 😃
R: DataExplorer
First, we need to execute the simple code below:
# install the DataExplorer library
install.packages("DataExplorer")
# load the DataExplorer library
library(DataExplorer)
# use create_report
create_report(df)
From the first until the sixth snapshot (Images 6, 7, 8, 9, 10, 11), the information we got was not much different from the previous package.
From the seventh snapshot (Image 12), we got the QQ plot for each numerical variable in the iris
dataset.
From the eighth snapshot (Image 13), we got the correlation matrix for each variable in the iris
dataset. We can see some information such as:
Petal.Width
andPetal.Length
has a strong positive correlation of 0.96, which means in theiris
dataset, the wider the petal width, the longer the petal length.Species_setosa
andPetal.Length
has a strong negative correlation of -0.92, which means in theiris
dataset, the shorter the petal length, the higher possibility the species is setosa.- Using the examples above, please provide your findings using this correlation matrix.
The ninth snapshot (Image 14), using principal component analysis (PCA), provides the percentage of the variance explained by principal components with a note that labels indicate the cumulative percentage of explained variance, it shows 62%, the higher the better. For the explanation of PCA, I think I need another post for this. 😆
The tenth snapshot (Image 15) provides the relative importance of each variable, it shows that Petal.Length
has the highest importance which is almost 0.5, followed by Petal.Width
, and so on.
R: SmartEDA
First, we need to execute the simple code below:
# install the SmartEDA library
install.packages("SmartEDA")
# load the SmartEDA library
library(SmartEDA)
# use ExpReport
ExpReport(df, op_file = 'SmartEDA_df.html')
From Images 16, 17, 18, 23, and 24, the information we got was not much different from the previous package.
From Image 19, shows us the density plot of each variable including the skewness and kurtosis measurement, which is used to tell us whether the data is normally distributed or not. The explanation of skewness and kurtosis also needs another post I guess 😅
From Image 20, 21, and 22, shows us the scatter plot between numerical variables available in the iris
dataset which is telling us the correlation, visually. It gives us similar information to the correlation matrix which is in numeric format.
R: Conclusion
Using the three packages above, we got a lot of information about the iris
dataset, much faster than we trying to create it manually, but it’s not enough, that’s why I said in the title “…faster and easier…”, because it only gives us a glimpse of the iris
dataset, but at least it gives us which things we can start to working on rather than looking for starting point, such as:
- No missing variables / no miscoded variables, we can skip these steps.
- The outlier detected in some variables, we can start to clean the data by using any proper methods to handle the outlier values, rather than looking for which variables have outlier values one by one manually.
- We can start to handle variables that are not normally distributed if needed.
- Based on the correlation matrix and scatter plots, we got a glimpse of which variables strongly or weakly correlated.
- Using principal component analysis (PCA) provides the percentage of the variance explained by principal components with a note that labels indicate the cumulative percentage of explained variance
- The relative importance of each feature of the
iris
dataset is also shown in this automated EDA.
Python: ydata-profiling
First, we need to execute the simple code below:
# install the ydata-profiling package
pip install ydata-profiling
# load the ydata_profiling package
from ydata_profiling import ProfileReport
# use ProfileReport
pr_df = ProfileReport(df)
# show pr_df
pr_df
Mostly, it shows similar information. I will try to mention some information that is quite different from previous packages:
- In image 26, we got a summary in sentences about which variables have a high correlation.
- Overall, the output is more interactive compared to previous packages, because we can click to move to other tabs, and select specific columns to be displayed.
Python: dtale
First, we need to execute the simple code below:
# install the dtale package
pip install dtale
# load the dtale
import dtale
# use show
dtale.show(df)
The output of this package is very different from previous packages, in terms of how to use it, the content is quite similar, but it makes us can explore better.
Python: sweetviz
First, we need to execute the simple code below:
# install the sweetviz package
pip install sweetviz
# load the sweetviz
import sweetviz
# use analyze
analyze_df = sweetviz.analyze([df, "df"], target_feat = 'species')
# then show
analyze_df.show_html('analyze.html')
Using this package, the UI and UX are very different, please enjoy the show!
Human beings are visual creatures, Which means that the human brain processes images 60,000 times faster than text, and 90 percent of information transmitted to the brain is visual. Visual information makes it easier to collaborate, and generate new ideas that impact organizational performance. That’s the only reason that data analyst spends their maximum time in data visualization.
Python: autoviz
First, we need to execute the simple code below:
# install the dtale package
pip install autoviz
# load the autoviz
from autoviz import AutoViz_Class
# set AutoViz_Class()
av = AutoViz_Class()
# produce AutoVize_Class of df
avt = av.AutoViz(
"",
sep = ",",
depVar = "",
dfte = df,
header = 0,
verbose = 1,
lowess = False,
chart_format = "server",
max_rows_analyzed = 10000,
max_cols_analyzed = 10,
save_plot_dir=None
)
Using the code above, some tabs are generated in the browser. The new things that we can see using this package:
- The output is generated in multiple tabs in the browser, previous packages display all the outputs in one tab.
- Violin plot of each variable. It’s a hybrid version of the boxplot and kernel density plot. Still shows similar information compared to previous packages.
Python: Conclusion
Using the four packages above, we got a lot of information about the iris
dataset, not too much difference compared to R packages, but still when having more perspective is usually better than having less perspective. Some notes:
- The output of Python packages is mostly more interactive compared to R packages.
- When installing the packages, some errors may occur. For the
dtale
, the common error is aboutjinja
andescape
. You can get the solution by referring to this post. - In some packages, the code is not as simple as in R packages, but I think it’s not a major problem, as long as we are not lazy to read the manual instruction, I think everything is okay.
Conclusion
Which one do I have to use? Which one is the best? Which one is the most compatible with my dataset?
It depends. I think we can cut the time we need to do EDA is already a good thing. Let’s try to explore each package explained above and use it wisely, not as the main solution. In my humble opinion, exploring the data should be the “fun” part of data analysis, so don’t be afraid to get “dirty” by doing the EDA manually, sometimes the non-automated method is still the best. 👍
Thank you for reading!
Woah, I’ve just realized this post contains 43 images. If you reach this point, I’m grateful that you want to read and learn about how to automate the EDA process through my post. I hope you enjoy this post and learn how to use it in your journey as a data analytics/science professional.
I am learning to write, mistakes are unavoidable, even when I try my best. If you find any problems/mistakes, please let me know!