Speed Dating With YData-Profiling

Eugenius Mauritz Rafael
Python’s Gurus
Published in
5 min readJun 5, 2024

As an aspiring data enthusiast, I learn by immersing myself in datasets and analysis training. During analysis, a significant amount of time is dedicated to data cleaning and exploration. Good data quality can create a positive first impression of analysis success, while the poor one often leads to a negative first impression. Dealing with the latter consumes a substantial amount of time; however, the workload never seems to subside. A feature akin to dating apps, offering brief information about data, would be immensely helpful. Luckily, such a feature exists in Python through the library known as “YData-Profiling” (formerly “Pandas Profiling”), which provides a quick overview of a dataset with just a few lines of code.

Here, I attempted to conduct profiling on the “Penguins” dataset using Seaborn.

import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport

df = sns.load_dataset('penguins')


profile = ProfileReport(df, title = 'Profile Report')
Profile

Data Overview

Data Overview by Profiling

In the initial stage of profiling, it’s crucial to gain an overview of the dataset. Similar to the functionality provided by pandas, data profiling offers information such as the number of variables, total rows, missing cells, duplicate rows, and variable types. However, it’s important to note that automated processes, including the determination of variable types by AI, may not always be entirely accurate. Therefore, users should carefully review the variable types before proceeding with further analysis. Even though such simple code exists to retrieve this information, it’s essential to exercise caution and verify the results to ensure accurate insights into the dataset.

Exploratory Data Analysis

After profiling, further examination of the data is conducted through Exploratory Data Analysis (EDA) to glean insights from the dataset. Categorical variables are automatically displayed using bar charts and analyzed for distinct count and missing data count. Additionally, more detailed analysis is available to inspect the words contained within the variable. A function is provided to check whether words might be rephrased or mistyped, enabling users to correct the data as needed.

Categorical Variable EDA

Numerical variables are automatically displayed using histograms and analyzed for distinct count and missing data count, similar to categorical variables. Additionally, zeros, negative numbers, and infinite values can be examined. Furthermore, more detailed analysis is available to provide descriptive statistics such as the minimum value, maximum value, quartiles (Q1, median, Q3), standard deviation, and more. We can also check for common values, which are those that frequently appear, and extreme values, which are the top 10 minimum and maximum values.

Numerical Variable EDA

Additionally, there might be a label indicating “High Correlation”, suggesting that a feature has a strong correlation with other features. Fortunately, Ydata-profiling also provides correlation values for numerical variables, along with heatmap and scatter plot visualizations.

Heatmap Correlations
Scatterplot Interactions

After identifying the correlations between features in the dataset, we delve deeper into the details of missing values. An interesting aspect is the visualization of missing values using a nullity matrix or MissingNo matrix. This visualization helps quickly identify patterns of missing data, making it easy to spot rows or columns that are completely missing or where data is incomplete in some features. Consequently, users can consider the most suitable cleaning or imputation methods for the related cases.

Nullity Matrix

Based on the nullity matrix, it’s observed that all penguins’ species and island origins are identified, while other features such as bill length, bill depth, flipper length, body mass, and sex are missing. We can assume that each penguin species on a certain island shares the same genetics and grows in a similar environment, influencing its growth. Therefore, missing numerical values can be imputed by the mean value (or median to maintain a robust distribution) based on the species and island. For the ‘sex’ category, the mode can be used based on the species and island.

The final section of the profile report includes a dataset sample, displaying the head of the dataset to provide a glimpse of its structure and contents.

Don’t forget to save the profile report if you have other tasks to prioritize. The profile will be saved in the same folder as your Jupyter notebook location.

profile.to_file('Penguins.html')
Head Dataset Sample

Working with datasets that exhibit various data characteristics can be challenging. However, Ydata-profiling offers a one-stop solution to understand your data with just a single line of code. With its assistance, we can easily observe our data and determine the best strategy to clean it. Despite its usefulness, AI-generated profiling may sometimes misinterpret variables and display incorrect charts. As humans, it’s essential to double-check the data to ensure it is interpreted accurately according to our needs. Hopefully, this article provides informative insights and assists fellow data enthusiasts in their endeavors. Let’s stay motivated and productive!

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

  • Be sure to clap x50 time and follow the writer ️👏️️
  • Follow us: Newsletter
  • Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

--

--

Eugenius Mauritz Rafael
Python’s Gurus

Aspiring data enthusiast with a robust background in Mechanical Engineering, currently transitioning into the field of Data Science and Machine Learning.