Data Profiling using Dataplex in GCP

Gaurav Khandelwal
Google Cloud - Community
4 min readDec 14, 2022
Photo by Austin on Unsplash

Data profiling is the process of examining, analyzing, reviewing and summarizing data sets to gain insight into your data. Data profiling helps you discover and understand your data with details like structure, quality and content of source data. It offers critical insights into the information that a user or organization can leverage to its benefit for fast decision-making and analysis.

Dataplex Data Profile

Data users can now easily understand common characteristics of their data in GCP by leveraging the Data Profile feature of Dataplex. You can analyze the profile of your Dataplex managed data by configuring and scheduling checks over your data. It identifies common statistical characteristics like NULL % , unique% or distribution characteristics about your data.

A data profiling scan is associated with one BigQuery table and scans the table to generate the profiling results. The results of the scan are available as part of every scan execution.

Photo from Google Documentation

*Please note Dataplex Data Profiling is currently in public preview with a limited feature set*.

Demo to Create Data Profile Scan

Follow below simple steps to scan your first Bigquery table in GCP.

Step 1

In Dataplex, you will find a new link as Profile(PREVIEW). You can create a new data profile scan by clicking it

Step 2

Put all required details such as Display name, id etc as shown below

Step 3

Select table to be scanned. The BigQuery table to be scanned must be part of a Dataplex lake.

Step 4

Select scope as “Entire data” or “Incremental”. You can choose between “Repeat” or “On-Demand” Scheduling options. Data profiling can be run on a schedule to look at newer data for most up to date profile information.

Step 5

Once you hit the “Create” button, your profile scan would be created as shown below

On clicking the job name, you can find all the details of your profile job such as table name, schedule time etc

Step 6

Your scan would start at the scheduled time with status showing as “Running”. Once it is complete, status would change to “Succeeded” and you can see the Jobs Details and Scan results.

Data Profiling Results

You will find interesting and useful results based on your data. This information speeds up analytics and decision making. For example, dataset taken in this scan was a cab trip data having fields such as company name, fare, payment_type etc and below are some observations.

Observation 1: Entire Dataset has 3 companies data with 303 taxi being the leader.

Observation 2- Distribution of fare, with minimum being as low as .01 to maximum as 107.6

Observation 3- 82.3% payment is being done with cash.

Observation 4- Other statistical calculations such as average trip mile is 3.8 etc.

Summary:

Data profile is very useful to understand your source data and makes it possible to analyze data more effectively. Dataplex data profiling lets you identify common statistical characteristics of the columns of your BigQuery tables.

For more details about this feature and to get latest updates please refer to Google Documentation.

--

--