Data Profiling with Python

9 min readApr 4, 2023

What is Data Profiling?

Data profiling is the process of examining and analyzing data to gain insights into its structure, quality, completeness, and other characteristics. It involves the use of various techniques and tools to collect and analyze metadata, statistics, and patterns from a dataset, including its size, type, range, format, and relationships between its variables.

Typically, data profiling involves tasks such as identifying data types, analyzing data distributions, checking data quality, detecting data anomalies, and visualizing data patterns. Some common tools used for data profiling include data profiling software, data quality tools, data visualization tools, and statistical software packages.

The information gathered during data profiling can be used to make decisions about data management, such as data cleaning, data transformation, and data integration. Data profiling is often used as a preliminary step before data analysis, data migration, or data warehousing.

What are the Most Critical Capabilities of a Data Profiling Library?

Python is the most common software language for data operations. We have hundreds of open-source libraries that solve various data operation problems in Python. In this sense, there are various data profiling libraries in Python; Great Expectations, ydata-profiling, lux, and DataProfiler.

Some of the top capabilities that a data profiling library should offer are;

Data analysis: A data profiling library can provide various methods for data analysis, such as statistical analysis, data visualization, and pattern recognition.
Data visualization: A data profiling library can provide various methods for data visualization, such as histograms, scatter plots, and heat maps.
Data quality checks: A data profiling library can provide checks for data quality, such as data completeness, data accuracy, and data consistency.
Data profiling reports: A data profiling library can generate detailed reports on data quality, data completeness, data accuracy, and data consistency.
Data profiling automation: A data profiling library can provide capabilities for automating data profiling tasks, such as scheduling data profiling runs, integrating with other data management tools, and generating alerts for data quality issues.

Data Profiling Libraries

For this part, I will exclude the Great Expectations since I have a dedicated article for that. In case you want to learn more you can check my previous article;

Data Validation, Documentation, and Profiling with Great Expectations

When the organization is small, the impact of data quality issues often doesn’t cause big problems. The data team…

medium.com

In the following sections, I will test the Python libraries and compare their capabilities. For testing purposes, I will use the NBFI Vehicle Loan repayment Dataset at Kaggle.

1- ydata-profiling

Pandas-profiling originated to support Python Pandas objects. As it evolved to support Spark, the library is re-branded under ydata-profiling.

ydata-profiling is not a built-in Python package. You need to install it within your terminal with the pip install ydata-profiling command.

Key features

Below you can find the key features of ydata-profiling library;

Type inference: automatic detection of columns’ data types (Categorical, Numerical, Date, etc.)
Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables' pairwise interaction
Time-Series: including different statistical information relative to time-dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic), and blocks (ASCII, Cyrilic)
File and Image analysis: file sizes, creation dates, dimensions, an indication of truncated images, and the existence of EXIF metadata
Compare datasets: one-line solution to enable a fast and complete report on the comparison of datasets
Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for easy integration in automated systems and as a widget in a Jupyter Notebook.
Integrations: automating the profiling operation in various steps is crucial for ongoing operations. The library supports integrations with the other major open-source tools in the modern data stack; Great Expectations, Alitflow, Prefect, etc.

Example

In the Jupyter Notebook, I will import the dataset and execute the ProfileReport function to get the full report;

import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv("data/Train_Dataset.csv")
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

The report is automatically generated and it is an interactive object. You can scroll between different tabs and columns to get what you seek. The report contains the sections below;

Overview

Under the overview section, we have Overview, Alerts, and Reproduction sections.

The Overview section gives us a birds-eye-view of the data set with general data quality statistics.

The Alerts section gives us the potential misfits in the data set. For example, our data set contains 67 cases that require generating an alarm;

The Reproduction section gives us the details in case we want to reproduce this report in the future. The config.json object contains all the configurations of the generated report.

Variables

The variables section focuses on univariate reports about the variables. By clicking the “More details” button, we get detailed descriptive statistics, distribution information, etc about each variable.

Interactions

In the interactions section, through the interactive chart, we can visualize the interactions between two variables;

Correlations

In the correlation section, we can visualize and report the correlations between variables;

Missing values

In the missing values section, we can deep dive into the missingness of the data set;

Sample

In the sample section, we can visualize the first or last rows of the data set.

Great Expectations Integration

Great Expectations is the leading tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams. For detailed information about the tool you can check my article;

Data Validation, Documentation, and Profiling with Great Expectations

When the organization is small, the impact of data quality issues often doesn’t cause big problems. The data team…

medium.com

ydata-profiling provides a simple to_expectation_suite() method that returns a Great Expectations ExpectationSuite object which contains a set of Expectations. Below I am going to create an Expectation Suit directly produced by the profile object.

import great_expectations as ge

data_context = ge.data_context.DataContext(
    context_root_dir="../great_expectations"
)

suite = profile.to_expectation_suite(suite_name="profiling-demo",
    data_context=data_context,
    save_suite=False,
    run_validation=False,
    build_data_docs=False,)

# Save the suite
data_context.save_expectation_suite(suite)

# Run validation on your dataframe
batch = ge.dataset.PandasDataset(df, expectation_suite=suite)


results = data_context.run_validation_operator(
    "action_list_operator", assets_to_validate=[batch]
)

validation_result_identifier = results.list_validation_result_identifiers()[0]

# Build and open data docs
data_context.build_data_docs()
data_context.open_data_docs(validation_result_identifier)

DAG Execution Tools

With Python, command-line and Jupyter interfaces, ydata-profiling integrates seamlessly with DAG execution tools like Airflow, Dagster, Kedro, and Prefect, allowing it to easily becomes a building block of data ingestion and analysis pipelines. Check out the available examples here.

Streamlit Integration

Streamlit is an open-source Python library made to build web apps for machine learning and data science. For detailed information about the tool you can check my article;

Why Every Data Scientist Should Learn Streamlit?

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data…

medium.com

If you want to build a web application to visualize the profile report, you can directly integrate Streamlit with ydata-profiling. Check out the available examples here.

2- Lux

Lux is a Python library that facilitates fast and easy data exploration by automating the visualization and data analysis process. By simply printing out a dataframe in a Jupyter notebook, Lux recommends a set of visualizations highlighting interesting trends and patterns in the dataset.

Lux is not a built-in Python package. You need to install it within your terminal with the pip install lux-api command. After the installations;

To activate the Jupyter Notebook extension:

jupyter nbextension install --py luxwidget
jupyter nbextension enable --py luxwidget

To activate the JupyterLab extension:

jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install luxwidget

Example

In the Jupyter Notebook, I will import the dataset into Pandas Dataframe;

import pandas as pd
import lux

df = pd.read_csv("data/Train_Dataset.csv")

Lux has an interactive widget that gets activated as soon as we load the Pandas Dataframe object.

When we click on the “Toggle Pandas/Lux” object, Lux converts the table view of the Pandas Dataframe into interactive charts in three sections;

Correlation

In the correlation section, we can visualize the correlations between two variables;

Distribution

In the distribution section, we can visualize the univariate distributions of the quantitative variables;

Occurrence

In the occurrence section, we can visualize the frequency distributions of categorical variables;

Below you can find a demonstration of the interactive charts;

Intent

Lux can generate the default reporting recommendations as above. In some cases, we may want to focus on specific variables in our analyses. In these situations, Lux enables users to specific variables with the focus of intent. For detailed information, you can follow the official page.

3- DataProfiler

The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.

Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI), and more. Data Profiles can then be used in downstream applications or reports.

DataProfiler has an extensive statistics portfolio for structured, unstructured, and graph data formats. For the detailed statistics list you can check the official documentation.

DataProfiler supports integration with the Great Expectations library. You can check some examples in the official documentation.

Example

In the Jupyter Notebook, I will import the dataset and execute the Profiler function to get the full report;

from dataprofiler import Data, Profiler
import json

df = Data("data/Train_Dataset.csv")
profile = Profiler(df)
report  = profile.report(report_options={"output_format": "pretty"})
print('\nREPORT:\n' + '='*80)
print(json.dumps(report, indent=4))

DataProfiler doesn’t have any visual representation. It calculates all the statistics in a dictionary and we need to print it out. Below you can share a glimpse of the report generated;

Conclusion

In this article, I covered three different data profiling libraries. These libraries target different user groups and use cases;

Enterprise solution for the whole data team: ydata-profiling does everything you may expect from an open-source library. While it provides easy-to-read text formatted reports, it creates great visuals at the same time. Integrations with Great Expectations, web applications, reproducibility, and DAG execution tools position it as the best solution in the market today.
Visual-focused data analysts: Lux aims to provide easy-to-use interactive charts in the Jupyter Notebook with almost no integration and effort needed.
Reproducible data quality focused data engineers: DataProfiler aims to provide tons of different statistics on various types of data formats.

Thanks a lot for reading 🙏

If you are interested in data quality topics, you can check my other articles;

Timeless Obstacle for Data Products: Data Quality

Data products are the future! Today we are surrounded by various AI and ML-generated data products supporting our…

medium.com

Preventing Data Quality Issues with Unit Testing

Using Pytest to write robust unit tests to prevent data quality issues

medium.com

Dataframe Validation with Pandera

Lightweight data frame validation library for data analysts, data scientists, data engineers, and machine learning…

medium.com

If you are interested in data and leadership topics, you can check out my glossary for all other articles;

Welcome

Welcome on board

medium.com

If you want to get in touch, you can find me on Linkedin and Mentoring Club!

Data Profiling with Python

What is Data Profiling?

What are the Most Critical Capabilities of a Data Profiling Library?

Data Profiling Libraries

Data Validation, Documentation, and Profiling with Great Expectations

When the organization is small, the impact of data quality issues often doesn’t cause big problems. The data team…

1- ydata-profiling

Key features

Great Expectations Integration

Data Validation, Documentation, and Profiling with Great Expectations

When the organization is small, the impact of data quality issues often doesn’t cause big problems. The data team…

DAG Execution Tools

Streamlit Integration

Why Every Data Scientist Should Learn Streamlit?

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data…

2- Lux

3- DataProfiler

Conclusion

Thanks a lot for reading 🙏

Timeless Obstacle for Data Products: Data Quality

Data products are the future! Today we are surrounded by various AI and ML-generated data products supporting our…

Preventing Data Quality Issues with Unit Testing

Using Pytest to write robust unit tests to prevent data quality issues

Dataframe Validation with Pandera

Lightweight data frame validation library for data analysts, data scientists, data engineers, and machine learning…

Welcome

Welcome on board

Written by Seckin Dinc