Data Profiling At Scale

Single line of code data profiling with Spark

The great debut of pandas-profiling into the big data landscape

Fabiana Clemente
The Techlife

--

Photo by ev on Unsplash

Introduction

Data profiling is known to be a core step in the process of building quality data flows that impact business in a positive manner. It is the first step — and without a doubt, the most important — as the health of your data depends on how well you profile it.

With the increasingly big volumes of data being produced within organizations every day, another burden has been put on data teams: ensuring the profiling and quality of data at scale.

The process of profiling is a time-consuming process, even when considering small datasets, due to the need to continuously explore data and try to make sense of it to derive meaningful insights.

When Big Data comes into play, things only get worse, as huge volumes of data need to be processed and analyzed swiftly and efficiently.

The technical debt for data teams increases exponentially since development efforts grow in time, resources, people, and expertise. Oftentimes, this leads to data profiling and quality being overlooked by data scientists and engineers.

The result: Unreliable analytics and data pipelines that are a nightmare to manage.

Throughout this blog post, I’ll go through the challenges of effectively manipulating and analyzing big data structures, the strategies that data teams have been currently using and their respective pitfalls, and introduce you to the ydata-profiling package that — spoiler alert — will solve your problems in one line of code!

“A subset of data will do!” No, it won’t.

We all have already dealt with a dilemma when facing a dataset that needs more memory than the one we have available on our computer — “How can I start exploring this dataset?” and “What should I do with it?”.

The conclusion of this battle of thoughts? A sample will do.

After all, Pandas is the technology we are most familiar with: it is quick, flexible, and will get us the insights that we need to start with. But can we draw reliable conclusions from a sample?

Well, if you ask me, several additional questions arise: “Are these snapshots representative of the domain?”, “Are discovered trends generalizable for the overall domain?”, “Are we even looking at the ‘right chunk’ of data?”

Creating models and performing decisions based on a subset of data may simultaneously turn into a fool’s errand while having serious consequences for businesses and organizations.

It’s like looking at the world through a funnel: we may completely be missing the picture. This absolutely calls for scale!

However, whereas scalability is currently enabled by large-scale data processing engines — such as the well-known Spark framework — interpretability is introduced by data profiling tools, for which current big data solutions are limited.

What both data scientists and data engineers need is data profiling at scale. The answer? Support for data profiling with Spark.

Spark Profiling: “We’ve been eagerly waiting for it!”

Apache Spark

Pandas dataframes are the bread and butter of data manipulation and analysis that data scientists have grown accustomed to using. However, they do not scale for increasing volumes of data and hence do not serve the processes’ needs of data engineers.

In the alternative, one of the most popular solutions for big data is using Spark dataframes, as they enable the processing of large datasets in a distributed computing environment, although this is also possible in single-node environments! The adoption of big data engines saves both time and money.

We seem to have a conundrum on our hands:

  • On the one hand, Pandas dataframes are supported by nearly every data profiling tool out there, including the well-known pandas-profiling package, beloved by the data science community. Yet, Pandas dataframes simply can’t cut it in the big data world;
  • On the other hand, Spark dataframes are the most popular data structures for big data scenarios but are scarce in data profiling solutions and require a bigger learning curve to master.

To bypass the limitation of Pandas dataframes to work with larger volumes of data, data scientists and engineers have been exploring some creative though time-consuming strategies, with the two most common being:

  • Performing EDA on a Spark DataFrame using custom solutions;
  • Subsampling a Spark DataFrame into a Pandas DataFrame to leverage the features of a data profiling tool.

Let’s see how these operate and why they are somewhat faulty or impractical. As a daily Python user, all the examples depicted below will be using Pyspark==3.0.0 and Python version 3.10.

Performing EDA on a Spark DataFrame with custom code

The dataset used in this article can be found in Kaggle, the NYC Yellow taxy trip (License: ODbL: Open database license).

Similarly to the describe() method of Pandas, in a very familiar and convenient interface, Pyspark also offers a method describe():

from pyspark.sql import SparkSession

# In case a Spark session is not yet created
spark = SparkSession \
.builder \
.appName("Python Spark profiling example") \
.getOrCreate()
df = spark.read.csv("{insert-csv-file-path}")
# Inspect the data inferred schema
df.printSchema()
# Compute the basic statistics
df.describe().show()
NYC Yellow taxi dataset schema. Screenshot by author.

But as it happens with Pandas, although this is a very handy method to leverage, it is often not enough. Beyond overall data descriptors, a standard EDA process involves thoroughly summarizing other data characteristics (e.g., feature range and deviation, number of categorical/numeric features) as well as checking for inconsistencies such as duplicate or missing values, skewed data distributions, and uneven feature categories.

Let’s say that I want to perform a detailed analysis of the number of missing values per feature in the data:

from pyspark.sql.functions import col,sum

# For a generic validation of null values
col_missing_count_df = df.select([count(when(isnull(c), c)).alias(c) for c in df.columns])
col_missing_count_df.show()

# In case we want to consider other values for missing data

# Select only numerical and string types, as timestamps can be validated against isnan
cols = [f.name for f in pd_df.schema.fields if (isinstance(f.dataType, StringType) or isinstance(f.dataType,DoubleType))]

null_records = df.select([count(when(col(c).contains('None') |
col(c).contains('NULL') |
(col(c) == '') |
col(c).isNull() |
isnan(c), c)).alias(c)
for c in cols])

null_records.show()
Analysis of missing values per feature. Screenshot by author.

In case you want to compute a histogram per numerical variable, you will probably leverage:

import pandas as pd

# rdd offer a native function to compute histograms per column. Histogram for x column
hist = df.select(‘trip_distance’).rdd.flatMap(lambda x: x).histogram(10)

#convert to pandas for plotting
# Load the Computed Histogram into a Pandas Dataframe for plotting
hist_df = pd.DataFrame(list(zip(*hist)), columns=['bin', 'frequency']).set_index('bin')
hist_df.plot(kind='bar')
Trip distance histogram. Screenshot by the author.

Although this is possible, such type of profiling is time-consuming and becomes impractical if one has to continuously analyze snapshots of data to map out results over time.

More importantly, it could be somewhat dependent on the skill of the data scientist or data engineer conducting the analysis and how knowledgeable they are of the domain. It is also limited as several details need to be added or new questions need to be answered: it would certainly amount to a lot of loops and print statements.

Creating a Pandas DataFrame subsample and leveraging Data Profiling Tools

Another common workaround is taking advantage of data profiling tools such as Pandas Profiling by subsetting the data into a smaller chunk and profiling only that subsample. The idea is straightforward — get that smaller sample of the data into a Pandas DataFrame and proceed as usual. The following snippet of code shows how you can profile a random sample taken from a Spark DataFrame:

#Reading parquet files from a remote location
df = spark.read.parquet("s3://{insert-files-path}")

#Get 10% of the total records of the dataset and collect the results as a Pandas DataFrame
#In this case I’m not specifying the slice from where the sample is extracted, but #that definition as it impacts the records selected
pd_df = df.sample(fraction=0.1).toPandas()

report = ProfileReport(pd_df, title=”Sample profiling”)

#Export the profile as an html
report.to_file(‘sample_profile.html’)

We could also try a stratified sampling approach using the sampleBy method instead, but would that really make a difference?

The truth is that, although ingenious, this patch comes with some hidden costs.

First, profiling a subset of data requires additional computations to be performed in the background (e.g., calculating standard statistics such as features’ range, determining the number of observations, or analyzing duplicate records).

Secondly, and most importantly, there is no way to guarantee that the subsample is representative of the overall domain, and therefore, results taken from this subsample may differ from those that would be returned if all the available data were analyzed instead.

One Package to rule them all?

This is precisely the main advantage introduced with this new release of the most adored python profiling package of all times — the ability to get useful insights from large volumes of data in a visual report with a single line of code.

The well-established package pandas-profiling — now renamed to ydata-profiling due to the possibility of supporting other data structures other than Pandas DataFrames — opens the door to data profiling at scale by supporting Spark DataFrames!

The code snippet below depicts an example of how to profile data from a CSV while leveraging Pyspark and ydata-profiling.

The integration of ydata-profiling ProfileReport into your existing Spark flows can be seamlessly done by providing a Spark DataFrame as input. Based on the input type, the package then decides whether to leverage Pandas or Spark computations for the profile calculation, without the need to set a SparkSession.

This allows you to use your current session configurations without any changes, which eases the process of integration and maximizes the use of your Spark clusters setup.

from pyspark.sql import SparkSession
from ydata_profiling import ProfileReport

#create or use an existing Spark session
spark = SparkSession \
.builder \
.appName("Python Spark profiling example") \
.getOrCreate()

df = spark.read.csv("{insert-csv-file-path}")
df.printSchema()

report = ProfileReport(df, title="Profiling pyspark DataFrame")
report.to_file('profile.html')
Profiling Report: Data Quality Alters. Image by Author.
Profiling Report: Summary Statistics. Image by Author.

The interpretability cherry on top of the big data cake!

This new release opens the door to exciting opportunities for data profiling.

One is next-level data profiling: With the support of Spark DataFrames beyond Pandas DataFrames, this new release allows you to take your profiling to the next level by exploring larger volumes of data in a hassle-free way. If you’ve been an avid user of the previous version, get ready to take version 4.0.0 for a spin in the big data landscape with:

  • Quick Exploratory Data Analysis and Visualization: The visual support for Spark DataFrames fosters a straightforward data understanding and exploration in a single line of code, without the need to constantly write additional code to fit your EDA needs as you go;
  • Efficient Data Management and Troubleshooting: The standardized profiling allows to validate daily snapshots of data and troubleshoot constantly changing data requirements, sources, schemas, and formats, significantly boosting hard and tedious ETL processes;
  • Automatic Data Quality Control: The data quality checks provide immediate feedback on potential critical issues that need to be assessed in the data pipelines, such as rare events, data drifts, fairness constraints, or misalignment's with project goals;
  • Transparent and Clear Communication of Results: The comprehensive and flexible report makes it effortless to share your results within the team, without the need to produce additional material to discuss insights and align decisions.

The other is integration: This update also boosts the development of continuous practices of data profiling by enabling a seamless integration with platforms already leveraging Spark. And don’t worry: nothing is changed in your session configuration! We know how setting it all up is time-consuming and prone to errors and we don’t want to add any additional fuss.

ydata-profiling can be installed as an external package across different data processing platforms, so extracting insights from big data becomes as simple as pip install ydata-profiling on your preferred Spark-based solution. That’s it, no additional hacks required!

Final Thoughts

The Spark support added in this new release eases the burden of working with larger volumes of data and unleashes the power of data profiling for big data use cases.

Throughout this article, we’ve covered how to go from time-consuming or limited practices to manipulate and explore Spark dataframes to a quick-and-easy solution for big data profiling in one line of code using ydata-profiling.

Thanks to the support of the community and the extraordinary efforts of all contributors, the most requested feature on our roadmap has become a reality and we can’t wait to hear all about the exciting use cases you’ll try out with this new version!

In the next releases, we’ll keep improving this beloved data profiling package towards data scale and volume so feel free to follow our updates on GitHub and request additional features through our Discord server.

Happy (big) data profiling!

About me

Passionate for data. Thriving for the development of data quality solutions to help data scientists adopting data-centric AI.

LinkedIn | Data-Centric AI Community

--

--

Fabiana Clemente
The Techlife

Passionate for data. Thriving for the development of data privacy solutions while unlocking new data sources for data scientists at @YData