Data profiling in the age of big data

4 min readMay 31, 2019

What is big data and why should I care?

In today’s world of increasing connectivity, there is no denying the fact that the amount of data being generated is enormous. Add to this fact, how inexpensive storage has become and how fast the network is, it is no wonder that we are talking about big data.

Big data generally refers to a collection of large datasets that hold enormous information and hidden patterns which when harvested in the proper way provides an insight into the existing problems and if lucky to what the future holds. Some of the attributes, we generally associate with big-data have been captured in the picture above.

What does the big-data profiler do?

Before we can start doing anything with all the data we collect, it's essential to know what we are dealing with, because of the simple fact that our insights and decisions can only be as good as the quality of the data. Not just the quality aspect, knowing your dataset helps you to figure out what resources would be needed to deal with its scale and what treatment can be done. This is where we at Nordstrom, have come up with a big data profiler.

This profiler runs on our datasets and systematically asserts defined quality check rules and produces reports which help in performing audits. The profiler is also used to do ad-hoc data analysis and helps validate schema changes.

Data landscape at Nordstrom

Data for purposes of analytics at Nordstrom used to be stored in legacy data platform ( LDP ). Utmost care was taken to store customer data securely and in an anonymized fashion. Data would be structured, semi-structured or unstructured. This meant you could store all of your data without careful design or the need to know what questions you might need answers for in the future, which resulted in agility and faster data acquisition.

However, this also created several challenges. Unlike traditional data warehouses, there was no fixed constraint in a column or no single schema that was imposed. LDP tended to promote schema on read in place of schema on write. As a result, of this data scientists spent a lot of time to discover, interpret and clean the data. The onus was shifted to the application that was reading the data to do data quality checks and schema validations. This is where the data profiler comes in handy since it can generalize the problem and can be reused by several applications during a data read. The diagram below illustrates the flow.

Introduction to big data profiler

Nordstrom uses a big data infrastructure with frameworks like apache spark, apache hive, and presto. The big data profiler uses apache spark to profile the data, providing the end user configurable parameters to run the profiler for their use-cases.

In its current form, it supports the following features out of the box:

Schema validation if a schema is registered with a schema repository.
Some useful meta information about data like counts, sums and distinct.
Custom SQL queries on top of the data for data quality checks.
Reporting integration with datadog monitoring system.
Ability to assert values and raise alarms when there is some ambiguity with the profile.
Powerful visualization and automated report generation support via the usage of Jupyter.

All one has to do is pass in a configuration in the form of JSON, that drives the profiler utility. More on how to run this utility by passing this JSON configuration can be found here.

How is the big data profiler put together?

The big data profiler is a jupyter notebook, that takes in several varying parameters as inputs, connects to a remote Apache Spark cluster using spark-magic, profiles the data and reports results to datadog. Datadog is integrated with pagerduty which sends out an alert whenever the results don’t match expectations. The notebook is executed daily using varying parameters with the help of a library called papermill. Runs are scheduled using a workflow orchestrator in the form of Apache Airflow.

As part of the data profiling process, we have also integrated data schema validation. We use a common schema repository as a schema registration and versioning system. All upstream systems register their data schema with this schema repository. The consuming downstream system is then expected to either use a version of the schema or use the latest. In most cases, a new version of a schema is backward compatible. The data profiler of the downstream system makes sure that the data coming from the upstream is bound to the schema registered in the repository.

What’s Next

The data profiler in its current form and fashion only supports data on filesystems like hdfs, S3 or data on SQL query engines that support JDBC. We are currently working on adding capabilities to also profile Kafka streams and other data formats currently not supported. Work is also being done to extend the capabilities of the profiler to run an arbitary number of SQL queries over the dataset and add functions that can provide metrics about data distribution and column correlations.

Where can I get it?

The big data-profiler is now available on GitHub. Click on this link to learn more on install instructions and a real-world sample data profile.