Building High-Quality Data Pipelines With Collibra DQ and Cloudera DataFlow

4 min readMay 22, 2023

Introduction

Data quality is an important factor for any organization that deals with data. It directly impacts business performance, decision-making, and customer experience. Building data quality pipelines can help to ensure that data is accurate, complete, and consistent. In this article, we will discuss the business value of data quality pipelines using tools such as Cloudera DataFlow powered by Apache NiFi and Collibra Data Quality.

The Importance of Data Quality Pipelines

Data quality pipelines are a process of collecting, processing, and analyzing data to ensure that it meets data quality criteria. Data engineers work on the infrastructure required to deliver reliable and high-quality data to data consumers. As the data quality fundamentals shift with data volume, sources, storage, and the desired state, data engineers find it challenging to deliver healthy data pipelines and products. Data observability monitors data quality and reliability of data pipelines and rapidly helps remediate anomalies to deliver reliable and trusted data products. Following are some scenarios where Collibra Data Quality & Observability and Cloudera data flow helps deliver data health.

Delivering trusted data with auto-generated rules
Managing data lake health efficiently
Accelerating cloud data migration

Building High-Quality Data pipelines

Investing in building data quality pipelines with tools like Collibra Data Quality and Cloudera DataFlow can ensure that data is of high quality. This can lead to better business performance, improved decision-making, and enhanced customer experiences. The predictive self-service data quality is a use case scenario that demonstrates the power of Cloudera DataFlow and Collibra Data Quality.

Use Case Scenario

The use case scenario combines Cloudera DataFlow and Collibra Data Quality to build a predictive self-service data quality pipeline. Event triggers are set up for data arriving in cloud-native storage or JDBC data sources. Data quality scans are run on the newly arrived data, and dynamic decisions are made based on the data quality score. If the data quality is good, it is moved to the landing zone, and if it is low, a Slack message is sent with detailed data quality results. This scenario can scale to thousands of datasets at big data scale.

Collibra Data Quality API

Collibra Data Quality is integrated with Cloudera DataFlow using Rest API. Collibra Data Quality provides product API for end-users to interact with the official and supported API. The API is used to authenticate the DQ instance, run the DQ scan on specific datasets, wait for job completion, and get the DQ scan results.

https://productresources.collibra.com/docs/collibra/latest/Content/DataQuality/DQApis/to_rest-apis.htm

The following APIs used for the Demo.

DQ instance Authentication API to get bearer token

http://host:<port>/auth/signin

Run DQ Scan on specific dataset. (Runs Spark job or Pushdown Job)

http://host:<port>/v3/jobs/run?dataset={}&runDate={}&agentName={}'

Wait for job completion

http://host:<port>/v3/jobs/{}/waitForCompletion

Get DQ Scan results (json payload)

http://host:<port>{}/v3/jobs/findings

Cloudera dataFlow Details

Cloudera DataFlow is designed to accelerate flow development, and it’s interactive test session makesIt extremely easy to build and unit test the flow quickly. The demo flow leverages various OOTB processors and integrates with Collibra Data Quality at various stages of DQ scanning. On successful execution of the DQ job, it extracts the DQ results and makes dynamic decisions to push the good data to S3 bucket. In case of low DQ score, it sends the results to the Slack channel.

GenerateFlowFile : (GoodDQDemo & LowDQDemo) These are sample flow event generators which run the flow every 15 mins. GoodDQDemo runs the sales dataset from postgres DB and LowDQDemo runs the patient dataset.

InvokeHTTP: ( ColllibraDQAuth, RunDQScan, GetDQScanResults,

EvaluateJsonPath (ExtractAuthToken, ExtractDQScore)

UpdateAttribute

RouteOnAttribute

ExtractText

PutS3Object

PutSlack

DataFlow Template:

DQ_Pipeline/Self_Service_DQ_Pipeline-version-2.json at main · mpandithw/DQ_Pipeline

Contribute to mpandithw/DQ_Pipeline development by creating an account on GitHub.

github.com

Cloudera DataFlow

Collibra DQ Results

DQ Slack Alert

Conclusion

In conclusion, building data quality pipelines with tools like Collibra Data Quality and Cloudera DataFlow can help to ensure that data is accurate, complete, and consistent. This can lead to better business performance, improved decision-making, and enhanced customer experiences. Data observability monitors data quality and reliability of data pipelines and rapidly helps remediate anomalies to deliver reliable and trusted data products.The predictive self-service data quality is a use case scenario that demonstrates the power of Cloudera DataFlow and Collibra Data Quality.By leveraging these tools, organizations can ensure that their data is of high quality, which can have a positive impact on their business outcomes.

https://www.collibra.com/us/en/products/data-quality-and-observability