Quick Exploratory Data Analysis in just 5 minutes using these Python packages for Data Science Problem: Quick Start Guide
Here is one more reason I love python community, gone are the days where we need to write too many lines of code to perform exploratory data analysis on dataset.
Although self service data exploration analytics tools like Tableau are handy but not everyone can afford the license cost. Also some level of understanding of the tool is necessary to unleash all the data exploration capabilities. And you also have to be invest some time building exploratory charts to perform any kind of analysis, that still might not sound challenging to BI experts.
But what if we could do all the data exploration and profiling automatically in just a few lines of code. Just type 2–3 lines on your python notebook and go and grab some coffee/ tea. By the time you are back, you just have to download the generated reports with all the information you need, doesn’t that sound cool.
Let’s see 2 python libraries which can help in the exploratory data analysis with just a few lines of code.
Like the name, this python packages delivers what it says. Sweetviz is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code.
The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks. Currently the library generates 3 kind of reports:
Analyse single data frame.
Compare 2 data frames eg. training and test data frames
Compare 2 subsets of same data frame based on specified categories.
To try yourself download the python notebook from here.
Pandas Profiling is another python package that generates profile reports with all the below mentioned statistics in form of widget or html report.
- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
- File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.
Here is a sample notebook.
Note: A few functionalities have changed since release of this notebook, make sure you install latest version of library.
While installation, you might have to make sure you have latest matplotlib version installed in your virtual environment.
In case you wish to read more about exploratory data analysis , you can head over to this detailed one page article at Analytics Vidhya.
If you enjoyed the post then do check out my other posts too. Hope you find it helpful.