Data Profiling

Key to Data Project Success


Users always complain about the cost and timeline for data projects. They can view and analyze their data in Excel immediately. Data issues are addressed eventually. On IT projects, this is reversed — first solve all the data issues and only then get value. Otherwise you will end up with missing records, joins that don’t work, and ultimately end user mistrust.

So what’s to be done to keep the business happy? Help the users understand the additional effort! Show them what data anomalies you are tackling and get them involved in determining which need to be addressed and which can be overlooked (atleast initially).

Understanding Your Data

Most often, data projects map what data needs to go into which columns and then you start coding away, to load the data as desired. Your SMEs will call out some key data conditions that need to be handled, and then off you go. It’s like being handed the keys, a navigator, asked to drive a car once you’ve learned a few driving rules.

Data profiling is the task of understanding the data prior to actually working with it. This should include:

  1. What are the various required fields without which I shouldn’t use the data? Ex. Having a customer record with a blank name and email can be useless in a marketing context.
  2. Which are allowed to contain blanks?
  3. How do I map similar data? “CA” and “Ca” mean the same thing. So do “Active” and “A”.
  4. Which fields contain repeating values? Are all the values in some master dataset? Are there additional values that can’t be linked?
  5. Do certain values have special meaning? DOB of 1/1/1900 may have been used by the onboarding group to mean the user didn’t want to give their DOB, while 1/1/1920 may mean the data was brought in from another system where DOB wasn’t being captured.

Being able to quickly give the business users these data questions to answer helps them understand the work in data projects. It also prevents surprises later during testing when data is being rejected, or bad data is causing your queries to miss records and users are getting frustrated why they can’t see all their data.

Data Profiling in Qvikly

You can generate data profile reports from Qvikly that you can review with your users in minutes. Import the data you’ve gotten from your source providers into Qvikly, run the “Profile & Link” function, and then download the data profiling report. Some of the key things it will highlight:

  1. Histogram of how many of each value when there’s repeating values
  2. Blanks are called out
  3. For dates, ranges are provided
  4. For numbers, ranges as well as average and std. deviations are provided
  5. Skips long text (things like customer comment fields)
  6. Links with other data that you’ve already loaded into the project — to reconcile if the values are all matching or there’s some gaps

You should ensure then that you are handling all the repeating values. If there’s repeating values that are similar, then those have the chance to be mapped into the same final value. Any values of which there’s a high count may have special meaning, so be sure to confirm with the business.

Simple Steps to Build Confidence

These steps, with the help of a data profiling tool like Qvikly, build confidence with the business. This is work that they can understand needs to be done. It also builds confidence that IT is following a repeatable process and there will be fewer surprises later. Remember one of the biggest issues with data projects is delays due to “bad data” or rather “poorly understood data” and associated cost impacts. Allowing the project team to understand the data quality will mean the harder planning decisions can be made earlier about how much to scope in, what data cleanup is required, how much time to budget for testing based on the data, etc..

Read more about data profiling on our site at http://qvikly.com/data_profiling

Like what you read? Give Rajat Gupta a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.