Data Profiling using Apache Spark

Sajjad Sarwar
5 min readFeb 6, 2020

--

To ingest data with quality from external sources is really challenging, particularly when you’re not aware of how the data looks like or are ambiguous about its contents. That’s where the data profiling comes in. You can use data profiling to get insights on the data source and figure out how the data looks like.

What is Data Profiling?

Data Profiling is the process of running analysis on source data to understand it’s structure and content. You can get following insights by doing data profiling on a new dataset:

  • Structure Discovery: No of columns and their names in the source data
  • Content Discovery: Data types of the columns, Identify Nullable columns
  • Cardinality of Data: Number of Distinct Values in each column
  • Statistical Analysis: Min / Max / Mean / Std Dev of numerical columns
  • Value Histograms: Frequency of values in low-cardinality columns

Why Data Profiling?

Data profiling is typically needed to address following points:

  • Use data profiling before beginning to ingest data from a new source to discover if data is in suitable form — and make a “go / no go” decision on the dataset.
  • Identify and correct data quality issues in source data, even before starting to move it into target database.
  • Reconcile vendor specification with real data
  • Identify data quality issues that can be corrected by Extract-Transform-Load (ETL), while data is moved from source to target. Data profiling can uncover if additional processing is needed before feeding data to pipeline.

Profiling of Sample Dataset:

Before taking you through the technical implementation, I would like to show you the output of Data Profiling code that I wrote. You can run this code by yourself as well by following instructions in next section.

We’ll use a sample CSV file for weather data downloaded from a Kaggle dataset available at this link. The data has a station number and values of different sensors measured on a particular date. Following is a snippet of the dataset.

Preview of the Sample Dataset

After running the Data Profiler on this dataset, we get following output:

Data Profiler Output

The above output show following information for each of the columns present in the data

Completeness: (i.e. Fraction of the values having non-null values in the column, ranging from 0–1)

Number of Distinct Values in the column

Data Type of the column (i.e. String, Integer or Fractional)

Statistics of Numerical Columns (Min / Max / Mean / Std Dev)

You might find it strange to find that some of the columns that are supposed to be Numeric are marked as String by the profiler (e.g. snowfall & wetbulb columns). This is because some of the values in these columns had a String value. Let’s investigate this further. If we move down the output of Data Profiler, we can see Value Histograms of the columns. This is how the histogram for snowfall column looks like:

As you can see, there is one value ‘M’ of data type String appearing in this column for 35.75% of the times. Hence this column in marked as having String data type because of this particular value. This is where the value of Data Profiling comes in. It shows us what anomalies are there in the dataset and what data cleaning steps need to be performed before ingestion. For example, in this sample dataset, we need to perform data cleaning to replace ‘M’ with null values to represent missing values.

Lastly, Data Profiler also provides the Spark Code of StructType that this dataset should be having. This code snippet, after a careful deliberation of course, can be copied on to the pipeline code to define schema on top of the dataset. In a later post, I shall explain how we can leverage this StructType snippet to perform Data Quality checks on the datasets. Here’s the Spark StructType code proposed by the Data Profiler based on input data:

In addition to the above insights, you can also look at potential skewness in the data by looking data histogram values. There might be some values which are off the typical pattern, or which appear more often than others. This might result into misrepresented data, or might reflect a flaw in the data creation process that can be addressed on case-to-case basis.

How to Run the Data Profiler?

The Data Profiler that I wrote is essentially a Spark Zeppelin notebook. You can download the notebook from this link and import it into Zeppelin. Before using it, change the ‘inputDataPaths’ variable to the path of your desired dataset.

This notebook uses a a Spark library, named Deequ, to do data profiling of columnar sources. The documentation of Deequ is accessible at this link.

Before running the notebook, you will have to add Deequ’s jar file in Zeppelin’s Spark Interpreter. Follow these steps to add the jar:

  • Download the jar from this link
  • Place it at Zeppelin machine
  • Make sure jar is readable by Zeppelin user. (You can do that using bash command: chmod 644 /path/to/jar)
  • Go to Zeppelin Web UI, open the Menu on right top and select “Interpreters”
  • Look for the ‘spark’ interpreter. Go to dependencies and enter full path to the Deequ jar file.
  • Save the settings
  • Restart the interpreter

After that go back to the Zeppelin Notebook and run it. It will execute all the paragraphs one by one and present the output of Data Profiling.

Got questions? Feel free to reach out to me at sajjadsarwar1@gmail.com

Happy Data Profiling! ☺

--

--