Review Python Libraries to Accelerate Exploratory Data Analysis (EDA) [part 1/2]

9 min readJun 14, 2022

I have worked in data analysis for quite a while. Exploratory data analysis (EDA) is really helpful and important. Even you don’t intent to do classification model, regression model or any data science projects, EDA is still very useful when finding insights. I think all of you may familiar with pandas and dataframe. It provides many powerful functions yet easy to use. It has df.describe() and df.info() that tell us about basic statistics and type of our datassets. It also easily integreate with matplotlib and seaborn to create basic chart like lineplot, countplot, barplot, etc. Although these are actually easy to use, everytime I do EDA I still have to take a lot of time. To reduce the time I use in EDA, I tried to read many articles and tested it out myself. In this part 1, I am going to review the python libraries that are popular and really really easy to use

I first review these 4 libraries:

pandas_profiling
SweetViz
AutoViz
Dtale

I also provide my code in Github: https://github.com/RunnyKub/Automate_eda/blob/main/Automate%20EDA%20tools%20part1.ipynb

Dataset

I used California housing dataset. I fetched it using sklearn library. As this dataset provide only numerical features, I added other type of features into it. First I created function to randomly generate datetime and another feature using date() to datetime feature to be date type. I also want categorical feature so I just randomly assume that data points that have Latitude greater than 36 are ‘North’ and the rest are ‘South’. I also see the mean of MedHouseVal and create column named ‘is_expensive’. If MedHouseVal greater than 2.5, there are True, if not, there are False.

import sklearn.datasetsdata = sklearn.datasets.fetch_california_housing(return_X_y=False, as_frame=True)
df = data["frame"]def random_date(start, end, random_amount):
    delta = (end - start).days
    return [start + timedelta(days=random.randrange(delta)) for i in range(random_amount)]d1 = datetime.strptime('2022-01-01', '%Y-%m-%d')
d2 = datetime.strptime('2022-03-31', '%Y-%m-%d')df['datetime'] = random_date(d1,d2,df.shape[0])
df['date'] = df['datetime'].apply(lambda x: x.date())
df = df.reindex(columns= ['datetime', 'date'] + list(df.columns[:-2]))
df['direction'] = df['Latitude'].apply(lambda x: 'North' if x >= 36 else 'South')
df['is_expensive'] = df['MedHouseVal'].apply(lambda x: True if x >= 2.5 else False)
df.head()

Pandas_profiling

I read through this https://pypi.org/project/pandas-profiling/ and tried it with the dataset i’d just prepared.

In the image 1, it show the pandas_profiling report on the jupyter (I use jupyter lab for working). You can see in the coding cell that I use only 4 lines of code to generate the report. In fact, only 3 lines of code are required if you don’t want to save the report. I will not go through every details, only the one that I see useful. It’s really easy just fill in your dataframe into function.

In the overview tab, there a lot of useful information. It tells the basic info of the dataset: the number of features, the number of data points, the missing cells, duplicate rows and the type of features. You may see it show ‘Unsupported’. It is the column ‘date’ that I generated. So I think pandas_profilling may not support on some unnormal type. But I still think this tool is very useful.

In variables tab, you can see the basic statistics of every features in your dataframe include the histrogram. In the sub-tab in each feature, it shows common values like method df.value_counts() with horizental barchart. The Extreme values sub-tab shows the count for maximum and minimum values of the feature.

Image 4: pandas_profiling Interactions tab

In interactions tab, you will see scatter plot and you can easily choose which combination of features you want to see. If the data points are too much, the scatter plot will automatically change into Hexagonal binned that you can see clearer.

Image 5: pandas_profiling correlations tab

The correlations tab show the heatmap and you can see the top right has the button ‘Toggle correlation descriptions’ that will show you the description of the measure.

Image 5: pandas_profiling Duplicate rows tab

In my dataset that I added some features don’t have duplicate rows, but as I test it on other dataset like Iris dataset. It shows the Duplicate rows tab that really good to me. I have add python code and report file that I saved in html into my Github.

SweetViz

SweetViz provides document in https://pypi.org/project/sweetviz/.

In just a few lines of code, you can generate the profiling report. In fact this SweetViz has many things that similar to pandas_profiling. I suggest to show the report in html format. With .show_html method, the report will be saved in the .html file. The report will automatically open in new tab if you set open_browser option to be ‘True’.

At the top of the html report, ‘Association’ button shows us the correlation of each feature. As I see, it shows only pearson’s correlation coefficient, while pandas_profilling can change to others correlation coefficient such as Spearman or Kendall.

Image 8: SweetViz shows on numerical feature

If I click on numerical feature, I will see histrogram. You can change the bin size by clicking on the buttons at the top right of histrogram. It also shows the most frequent values, smallest and largest values of the feature.

Image 9: SweetViz shows on categorical feature

The report shows information for categorical and boolean feature the same. It show percentage of each category and also counting.

SweetViz also be able to compare two dataframe. It can compare training and testing dataset or compare two subsets of the same dataframe.

Image 10: SweetViz example command to compare two subsets of the same dataframe

I tried comparing my custom categorical feature ‘direction’. Setting the main subset for ‘North’ and the other subset for ‘South’.

The information is the same as profiling entire dataset but it shows the details comparing two subset in different colors.

AutoViz

AutoViz update thier new features in https://github.com/AutoViML/AutoViz. It shows a lot of chart in many aspects.

If you don’t need to adjust any data before importing into the AutoViz function, you can directly provide filename. You can also import data using DataFrame. The name of some arguments are quite unnormal to me like ‘dfte’ that use for importing DataFrame.

AutoViz really shows a lot of charts that I cannot capture all of them to show in this article. What I really like is Pair-wise scatter plot of every numerical features. It also shows distribution, box-plot, violin plot and correlations.

AutoViz also can generate chart with ‘bokeh’ by provide chart_format argument. This is really interesting because it turns out to be interactive chart. You can change feature in X-axis and Y-axis yourselves. If you hover your mouse on data point, it shows the details of that data point. The bokeh interactive charts really make me enjoy.

Dtale

When I tried to find out what Dtale can do (you can read the document https://pypi.org/project/dtale/), I really amaze that it can do a lot of tasks that help us reduce the coding. I will not go through every details because it really has many useful function and some functions I think I can do it easier by coding such as merging DataFrame function (Just my opinion**).

You can open Dtale in just 3 lines of code. You can choose to open it in the jupyter. But I suggest to open it in new browser. It will show DataFrame that you can scroll left and right to see all columns, unlike pandas that you have to set pandas’ option beforehand. In the top left you can see the shape of your DataFrame, and when you adjust or filter something that affect the shape, the number will change accordingly.

When you click at column header, you can do many things. I think type conversion and formats are really useful. You can easily convert type without coding, and in formats you can adjust the decimal digits and even choose how to display ‘nan’ value. In the bottom of column header function, you can filter the value of that column.

I really favour this Custom Filter function. It like the .query() but with Dtale you can apply it seamlessly without having to reload the dataset. And what really powerful is that you can filter by column header filter fuction and Custom Filter function at the same time. So it’s really flexible. You are able to export the code with filter by clicking on menu and select Code Export. The filter will be included in the exported code.

Describe function is where you do profiling with your data. Each data type show different in details but it all basic statistics for that feature. It shows boxplot and you can switch the tab to see histogram or even the Q-Q plot. The filter also affect the data shows here and you can export code for this GUI.

Dtale still can do other things. In Summarize data function, you can group by or pivot just by clicking. You can analyze time series data and easily create chart. Chart like line chart, bar chart, scatter plot can be created just by selecting the features and aggreation function. It can even show animation to show the change of data in different time period. But I really cannot find how to animate normal scatter plot. If someone know how to do it, please tell me how.

Finally

This is the first time I write the article. Please kindly comment to me if you see something wrong and please suggest me. I think every libraries have thier own strenght and short coming. But as it requires very little coding, you can benefit from using the combination of 2 or 3 libralies and select only what you see fit to youselves. I also collect my code in github: https://github.com/RunnyKub/Automate_eda/blob/main/Automate%20EDA%20tools%20part1.ipynb