Data profiling in the age of big data

What is big data and why should I care?

Big data generally refers to a collection of large datasets that hold enormous information and hidden patterns which when harvested in the proper way provides an insight into the existing problems and if lucky to what the future holds. Some of the attributes, we generally associate with big-data have been captured in the picture above.

What does the big-data profiler do?

This profiler runs on our datasets and systematically asserts defined quality check rules and produces reports which help in performing audits. The profiler is also used to do ad-hoc data analysis and helps validate schema changes.

Data landscape at Nordstrom

However, this also created several challenges. Unlike traditional data warehouses, there was no fixed constraint in a column or no single schema that was imposed. LDP tended to promote schema on read in place of schema on write. As a result, of this data scientists spent a lot of time to discover, interpret and clean the data. The onus was shifted to the application that was reading the data to do data quality checks and schema validations. This is where the data profiler comes in handy since it can generalize the problem and can be reused by several applications during a data read. The diagram below illustrates the flow.

Introduction to big data profiler

In its current form, it supports the following features out of the box:

  • Schema validation if a schema is registered with a schema repository.
  • Some useful meta information about data like counts, sums and distinct.
  • Custom SQL queries on top of the data for data quality checks.
  • Reporting integration with datadog monitoring system.
  • Ability to assert values and raise alarms when there is some ambiguity with the profile.
  • Powerful visualization and automated report generation support via the usage of Jupyter.

All one has to do is pass in a configuration in the form of JSON, that drives the profiler utility. More on how to run this utility by passing this JSON configuration can be found here.

How is the big data profiler put together?

As part of the data profiling process, we have also integrated data schema validation. We use a common schema repository as a schema registration and versioning system. All upstream systems register their data schema with this schema repository. The consuming downstream system is then expected to either use a version of the schema or use the latest. In most cases, a new version of a schema is backward compatible. The data profiler of the downstream system makes sure that the data coming from the upstream is bound to the schema registered in the repository.

What’s Next

Where can I get it?

Tech at Nordstrom

We create digitally connected experiences through people and technology. Help us dream about the customer shopping experience of the future. Code your career here: http://bit.ly/NordstromTechJobs

35

35 claps
Dipayan Chattopadhyay

Written by

Tech at Nordstrom

We create digitally connected experiences through people and technology. Help us dream about the customer shopping experience of the future. Code your career here: http://bit.ly/NordstromTechJobs