Pandas Profiling -One line code for your EDA

Omega Markos
Analytics Vidhya
Published in
4 min readMar 19, 2020

Exploratory data analysis (EDA) is the most important part of the data science pipeline. It helps us to fully understand our data and discover patterns, anomalies & test underlying assumptions. It also helps us define the insights we want to get from our data. There are lots of ways to perform our EDA and they all involve a lot of repetitive processes to all the variables involved. This task gets more complicated and is time-consuming, especially when we are dealing with a high dimensional dataset.

The simplest & most commonly used method is the pandas describe() function. It gives us a few important statistics but we still need to perform other tedious tasks to complete our EDA. There is another powerful alternative called pandas profiling.

Pandas profiling is an open-source python module that generates EDA profile reports such as descriptive statistics, quantile statistics, most frequent values, histograms, Correlations, and missing values. All this is done with one line of code.

To get started, first, we need to install pandas_profiling here.

pip install pandas-profiling

Or

conda install -c conda-forge pandas-profiling

The next step is to import pandas profiling & write the one line code. For this blog, I will be using the Forest Fires data set from the UCI Machine Learning Repository.

Code with default parameters

This code displays a one-page report. But to make it simple, I will try to show the four major sections of the report, which are Overview, Variable, Correlation & Sample.

  1. Overview

Dataset info: This section shows general information about our data, such as the number of variables/columns, missing value and the number of observations. It is very similar to the pandas .info() function.

Variable types: In addition to the common numeric & categorical types, other variables such as boolean and date are also recognized.

Overview with default parameters

Warnings: This shows valuable information such as zero values, duplicates, and rejected variables. To show more info on the Warnings section, I will rerun the report by adjusting the correlation threshold to 0.4 which was by default 0.9.

With a correlation threshold of 0.4

Now our warnings section shows the correlation coefficient and the rejected variables due to high correlation.

2. Variables

The following statistics are generated for each column:

Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range

Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.

Histogram: The frequency distribution for continuous variables and the counts for the categorical variables.

Common Values: The common values and their frequency

Extreme Values: The five minimum and maximum values and their frequencies.

The column “DC” ignored in the report

The highly correlated variables will be ignored and excluded from the report. In our example, ‘DC’ is rejected due to a high correlation with ‘DMC’ and it is ignored. But you can include those variables by adjusting the correlation_overrides and supply a list with the rejected columns you want to show.

Override the rejected variables
Including the rejected columns

3.Correlations

This shows the heatmap of highly correlated variables, Spearman, Pearson and Kendall matrices

Correlation

4.Sample

This is just like pandas head() function and it shows the first five samples of the data.

The first five samples of the data

Finally, you can export the report in HTML format by including this in your code.

HTML

I hope this helps to speed up your EDA.

Thanks for reading!

References:

https://pypi.org/project/pandas-profiling/#types

https://medium.com/@InDataLabs/why-start-a-data-science-project-with-exploratory-data-analysis-f90c0efcbe49

--

--

Omega Markos
Analytics Vidhya

Data scientist, seeking to help the world through data