How to perform a comprehensive exploratory data analysis with 3 lines of code in Python

Exploratory data analysis can be a breeze with Pandas Profiling.

Kuan Rong Chan, Ph.D.
Omics Diary

--

Exploratory data analysis is critical for data analysis. How do we systematically do it?

The first most important step in data analysis is to perform a exploratory data analysis (EDA). EDA involves inspecting for null and duplicate values, data preprocessing, assessing distribution of each variable and identifying simple trends between variables. Depending on the complexity of the datasets, data scientists can spend between 50%-90% of the time performing EDA, to ensure that the data is properly processed for further in-depth analysis.

The type of pre-processing methods used will depend on the data. For instance, if you have a few number of null values in your data, then it may be alright to drop these values. However, if a large proportion of them are null values, then you may want to use the mean, median or machine learning methods to fill up the missing values. Another example where EDA is useful is when deciding on the appropriate statistics to use. Importantly, the statistical tests used for variables with Gaussian distribution will be different from another dataset where the distribution may be skewed. Finally, the strength of correlation between different variables can provide insights for machine learning, where features with poor correlation or…

--

--

Kuan Rong Chan, Ph.D.
Omics Diary

Kuan Rong Chan, PhD, Senior Principal Research Scientist in Duke-NUS Medical School. Virologist | Data Scientist | Loves mahjong | Website: kuanrongchan.com