Power of Data Profiling in Analytics: A Practical Guide

Siladitya Ghosh
4 min readFeb 6, 2024

In the realm of analytics, understanding the intricacies of your dataset is fundamental to extracting meaningful insights. Data profiling emerges as a crucial process, offering a comprehensive examination of data characteristics, quality, and structure. In this article, we’ll explore what data profiling entails, its significance in analytics, and demonstrate how to leverage Python to perform data profiling on datasets residing in a data lake, specifically on Amazon S3.

What is Data Profiling?

Data profiling is a systematic analysis of datasets to uncover valuable insights into their structure, quality, and potential issues. It involves examining statistical summaries, identifying patterns, and understanding the distribution of data values. The primary goal is to gain a deep understanding of the data, enabling data practitioners to make informed decisions.

Using Data Profiling in Analytics

In analytics, data profiling serves multiple purposes:

Understanding Data Quality:

  • Data profiling helps assess the quality of data, identifying missing values, outliers, and inconsistencies.

Informing Data Cleaning Strategies:

  • Insights from data profiling guide data cleaning efforts, ensuring that the dataset is suitable for analysis.

Schema Discovery:

  • Data profiling assists in discovering the underlying structure of the data, including relationships between variables.

Enhancing Data Exploration:

  • Profiling results guide exploratory data analysis, providing a roadmap for further investigation.

Below is a sample Boto3 code using Python to read data from Amazon S3 and produce a data profile using the pandas_profiling Python package. Ensure you have both boto3 and pandas-profiling installed in your Python environment.

import boto3
import pandas as pd
from pandas_profiling import ProfileReport

# Specify your AWS credentials and S3 bucket details
aws_access_key_id = 'YOUR_ACCESS_KEY'
aws_secret_access_key = 'YOUR_SECRET_KEY'
region_name = 'YOUR_REGION'
bucket_name = 'YOUR_S3_BUCKET'
object_key = 'path/to/your/datafile.csv' # Update with your actual object key

# Initialize Boto3 S3 client
s3_client = boto3.client(
's3',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region_name
)

# Read data from S3 into a Pandas DataFrame
s3_object = s3_client.get_object(Bucket=bucket_name, Key=object_key)
df = pd.read_csv(s3_object['Body'])

# Generate data profile using pandas-profiling
profile = ProfileReport(df, title='Data Profile Report')

# Save the data profile report to a file (optional)
profile.to_file("data_profile_report.html")

# Display the data profile report
profile.to_notebook_iframe()

ake sure to replace 'YOUR_ACCESS_KEY', 'YOUR_SECRET_KEY', 'YOUR_REGION', 'YOUR_S3_BUCKET', and 'path/to/your/datafile.csv' with your actual AWS credentials and S3 details.

This code assumes that your data is in CSV format. If your data is in a different format, you may need to adjust the pd.read_csv line accordingly. Additionally, ensure that your environment has the required packages installed:

pip install boto3 pandas pandas-profiling

This script reads a CSV file from an S3 bucket into a Pandas DataFrame, generates a data profile using pandas_profiling, and optionally saves the profile report as an HTML file.

What’s Included in the Data Profile Report?

The generated data profile report includes:

Overview:

  • General statistics such as the number of variables, observations, and missing values.

Variables Section:

  • Detailed information about each variable, including type, unique values, missing values, and data distribution.

Correlations:

  • Insights into variable relationships, aiding in understanding dependencies between different features.

Warnings and Distributions:

  • Flags potential issues such as high cardinality or zero variance and visualizes data distributions.

Sample Data and Output

Consider a simplified sample dataset stored in the S3 data lake:

Upon running the provided code, the data profile report includes insightful information such as an overview of general statistics, variable details, correlations, warnings, distributions, and a sample of the dataset. Below is a snippet of the sample data profile report:

Overview:

  • Number of variables: 4
  • Number of observations: 5
  • Missing cells: 0

Variables Section:

  • Variable “Age” has a mean of 31.6, a minimum of 22, and a maximum of 42.
  • Variable “Salary” has a mean of $69000, a minimum of $55000, and a maximum of $90000.

Correlations:

  • Positive correlation between “Age” and “Salary.”

Warnings and Distributions:

  • No missing values, high cardinality, or zero variance observed.

Sample:

  • A snapshot of the dataset showcasing the first few rows.

This sample report provides a holistic view of the dataset, facilitating data analysts and scientists in making informed decisions during the analytical process.

There are several other Python packages for data profiling and exploratory data analysis. While pandas_profiling is popular and widely used, different packages may offer unique features or be better suited for specific use cases. Here are a few alternatives:

D-Tale:

  • GitHub Repository: man-group/dtale
  • Description: D-Tale provides an interactive web-based interface for exploring and visualizing Pandas DataFrames. It offers features for filtering, sorting, and visualizing data.

SweetViz:

  • GitHub Repository: fbdesignpro/sweetviz
  • Description: SweetViz is a visual exploration library that generates beautiful, high-density visualizations for both numerical and categorical data. It can create comparative reports between two datasets.

Pandas-Profiling-Extensions:

  • GitHub Repository: macbre/pandas-profiling-extensions
  • Description: An extension to pandas_profiling that includes additional profiling options and visualizations.

Autoviz:

  • GitHub Repository: AutoViML/AutoViz
  • Description: AutoViz automatically visualizes any dataset, choosing the most relevant charts and visualizations based on the data type and distribution.

ExploriPy:

  • GitHub Repository: nabeelsaleem/exploripy
  • Description: ExploriPy is designed for exploratory data analysis and data profiling. It provides visualizations and insights into datasets.

Remember to check the documentation and features of each package to determine which one aligns best with your specific needs and preferences. Depending on your dataset and analysis goals, you may find one of these alternatives more suitable.

In conclusion, data profiling stands as a cornerstone in the analytics workflow, offering a comprehensive understanding of datasets. Leveraging Python and tools like pandas_profiling empowers data practitioners to conduct efficient and insightful data profiling, paving the way for robust analytics and informed decision-making.

--

--

Siladitya Ghosh

Passionate tech enthusiast exploring limitless possibilities in technology, embracing innovation's evolving landscape