Data Profiling with Python

Seckin Dinc
9 min readApr 4, 2023
Photo by Luke Chesser on Unsplash

What is Data Profiling?

Data profiling is the process of examining and analyzing data to gain insights into its structure, quality, completeness, and other characteristics. It involves the use of various techniques and tools to collect and analyze metadata, statistics, and patterns from a dataset, including its size, type, range, format, and relationships between its variables.

Typically, data profiling involves tasks such as identifying data types, analyzing data distributions, checking data quality, detecting data anomalies, and visualizing data patterns. Some common tools used for data profiling include data profiling software, data quality tools, data visualization tools, and statistical software packages.

The information gathered during data profiling can be used to make decisions about data management, such as data cleaning, data transformation, and data integration. Data profiling is often used as a preliminary step before data analysis, data migration, or data warehousing.

What are the Most Critical Capabilities of a Data Profiling Library?

Python is the most common software language for data operations. We have hundreds of open-source libraries that solve various data operation problems in Python. In this sense, there are various data profiling libraries in Python; Great Expectations, ydata-profiling, lux, and DataProfiler.

Some of the top capabilities that a data profiling library should offer are;

  1. Data analysis: A data profiling library can provide various methods for data analysis, such as statistical analysis, data visualization, and pattern recognition.
  2. Data visualization: A data profiling library can provide various methods for data visualization, such as histograms, scatter plots, and heat maps.
  3. Data quality checks: A data profiling library can provide checks for data quality, such as data completeness, data accuracy, and data consistency.
  4. Data profiling reports: A data profiling library can generate detailed reports on data quality, data completeness, data accuracy, and data consistency.
  5. Data profiling automation: A data profiling library can provide capabilities for automating data profiling tasks, such as scheduling data profiling runs, integrating with other data management tools, and generating alerts for data quality issues.

Data Profiling Libraries

For this part, I will exclude the Great Expectations since I have a dedicated article for that. In case you want to learn more you can check my previous article;

In the following sections, I will test the Python libraries and compare their capabilities. For testing purposes, I will use the NBFI Vehicle Loan repayment Dataset at Kaggle.

1- ydata-profiling

Pandas-profiling originated to support Python Pandas objects. As it evolved to support Spark, the library is re-branded under ydata-profiling.

ydata-profiling is not a built-in Python package. You need to install it within your terminal with the pip install ydata-profiling command.

Key features

Below you can find the key features of ydata-profiling library;

  • Type inference: automatic detection of columns’ data types (Categorical, Numerical, Date, etc.)
  • Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
  • Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
  • Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables' pairwise interaction
  • Time-Series: including different statistical information relative to time-dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
  • Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic), and blocks (ASCII, Cyrilic)
  • File and Image analysis: file sizes, creation dates, dimensions, an indication of truncated images, and the existence of EXIF metadata
  • Compare datasets: one-line solution to enable a fast and complete report on the comparison of datasets
  • Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for easy integration in automated systems and as a widget in a Jupyter Notebook.
  • Integrations: automating the profiling operation in various steps is crucial for ongoing operations. The library supports integrations with the other major open-source tools in the modern data stack; Great Expectations, Alitflow, Prefect, etc.

Example

In the Jupyter Notebook, I will import the dataset and execute the ProfileReport function to get the full report;

import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv("data/Train_Dataset.csv")
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

The report is automatically generated and it is an interactive object. You can scroll between different tabs and columns to get what you seek. The report contains the sections below;

  • Overview

Under the overview section, we have Overview, Alerts, and Reproduction sections.

The Overview section gives us a birds-eye-view of the data set with general data quality statistics.

Image by the author

The Alerts section gives us the potential misfits in the data set. For example, our data set contains 67 cases that require generating an alarm;

Image by the author

The Reproduction section gives us the details in case we want to reproduce this report in the future. The config.json object contains all the configurations of the generated report.

Image by the author
  • Variables

The variables section focuses on univariate reports about the variables. By clicking the “More details” button, we get detailed descriptive statistics, distribution information, etc about each variable.

Image by the author
  • Interactions

In the interactions section, through the interactive chart, we can visualize the interactions between two variables;

Image by the author
  • Correlations

In the correlation section, we can visualize and report the correlations between variables;

Image by the author
  • Missing values

In the missing values section, we can deep dive into the missingness of the data set;

Image by the author
  • Sample

In the sample section, we can visualize the first or last rows of the data set.

Image by the author

Great Expectations Integration

Great Expectations is the leading tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams. For detailed information about the tool you can check my article;

ydata-profiling provides a simple to_expectation_suite() method that returns a Great Expectations ExpectationSuite object which contains a set of Expectations. Below I am going to create an Expectation Suit directly produced by the profile object.

import great_expectations as ge

data_context = ge.data_context.DataContext(
context_root_dir="../great_expectations"
)

suite = profile.to_expectation_suite(suite_name="profiling-demo",
data_context=data_context,
save_suite=False,
run_validation=False,
build_data_docs=False,)

# Save the suite
data_context.save_expectation_suite(suite)

# Run validation on your dataframe
batch = ge.dataset.PandasDataset(df, expectation_suite=suite)


results = data_context.run_validation_operator(
"action_list_operator", assets_to_validate=[batch]
)

validation_result_identifier = results.list_validation_result_identifiers()[0]

# Build and open data docs
data_context.build_data_docs()
data_context.open_data_docs(validation_result_identifier)
Image by the author

DAG Execution Tools

With Python, command-line and Jupyter interfaces, ydata-profiling integrates seamlessly with DAG execution tools like Airflow, Dagster, Kedro, and Prefect, allowing it to easily becomes a building block of data ingestion and analysis pipelines. Check out the available examples here.

Streamlit Integration

Streamlit is an open-source Python library made to build web apps for machine learning and data science. For detailed information about the tool you can check my article;

If you want to build a web application to visualize the profile report, you can directly integrate Streamlit with ydata-profiling. Check out the available examples here.

2- Lux

Lux is a Python library that facilitates fast and easy data exploration by automating the visualization and data analysis process. By simply printing out a dataframe in a Jupyter notebook, Lux recommends a set of visualizations highlighting interesting trends and patterns in the dataset.

Lux is not a built-in Python package. You need to install it within your terminal with the pip install lux-api command. After the installations;

To activate the Jupyter Notebook extension:

jupyter nbextension install --py luxwidget
jupyter nbextension enable --py luxwidget

To activate the JupyterLab extension:

jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install luxwidget

Example

In the Jupyter Notebook, I will import the dataset into Pandas Dataframe;

import pandas as pd
import lux

df = pd.read_csv("data/Train_Dataset.csv")

Lux has an interactive widget that gets activated as soon as we load the Pandas Dataframe object.

Image by the author

When we click on the “Toggle Pandas/Lux” object, Lux converts the table view of the Pandas Dataframe into interactive charts in three sections;

Correlation

In the correlation section, we can visualize the correlations between two variables;

Image by the author

Distribution

In the distribution section, we can visualize the univariate distributions of the quantitative variables;

Image by the author

Occurrence

In the occurrence section, we can visualize the frequency distributions of categorical variables;

Image by the author

Below you can find a demonstration of the interactive charts;

Image by the author

Intent

Lux can generate the default reporting recommendations as above. In some cases, we may want to focus on specific variables in our analyses. In these situations, Lux enables users to specific variables with the focus of intent. For detailed information, you can follow the official page.

3- DataProfiler

The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.

Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI), and more. Data Profiles can then be used in downstream applications or reports.

DataProfiler has an extensive statistics portfolio for structured, unstructured, and graph data formats. For the detailed statistics list you can check the official documentation.

DataProfiler supports integration with the Great Expectations library. You can check some examples in the official documentation.

Example

In the Jupyter Notebook, I will import the dataset and execute the Profiler function to get the full report;

from dataprofiler import Data, Profiler
import json

df = Data("data/Train_Dataset.csv")
profile = Profiler(df)
report = profile.report(report_options={"output_format": "pretty"})
print('\nREPORT:\n' + '='*80)
print(json.dumps(report, indent=4))

DataProfiler doesn’t have any visual representation. It calculates all the statistics in a dictionary and we need to print it out. Below you can share a glimpse of the report generated;

Image by the author

Conclusion

In this article, I covered three different data profiling libraries. These libraries target different user groups and use cases;

  • Enterprise solution for the whole data team: ydata-profiling does everything you may expect from an open-source library. While it provides easy-to-read text formatted reports, it creates great visuals at the same time. Integrations with Great Expectations, web applications, reproducibility, and DAG execution tools position it as the best solution in the market today.
  • Visual-focused data analysts: Lux aims to provide easy-to-use interactive charts in the Jupyter Notebook with almost no integration and effort needed.
  • Reproducible data quality focused data engineers: DataProfiler aims to provide tons of different statistics on various types of data formats.

Thanks a lot for reading 🙏

If you are interested in data quality topics, you can check my other articles;

If you are interested in data and leadership topics, you can check out my glossary for all other articles;

If you want to get in touch, you can find me on Linkedin and Mentoring Club!

--

--

Seckin Dinc

Building successful data teams to develop great data products