Scale Data Quality effortlessly on Google Cloud: Building a federated DQ framework empowered by Dataplex AutoDQ and BigQuery

Mansi Maharana
Google Cloud - Community
10 min readMay 8, 2024

The realm of data management is rapidly evolving, and enterprises are grappling with the challenge of ensuring data quality at scale. Data quality is the cornerstone of reliable information, minimizing data incidents, and ensuring regulatory compliance. The Data Mesh approach, a recently adopted paradigm in data management, emphasizes uniform data quality as a crucial factor for success. However, many organizations struggle with consistent data quality due to siloed data, fragmented implementation, and a lack of self-service infrastructure which often impede the implementation of data quality at scale.

Dataplex AutoDQ, addresses these challenges by offering a comprehensive solution that integrates with Google Cloud tools like BigQuery and other services, providing a centralized, scalable, and self-service data quality framework. BigQuery’s scalable storage and federation capability which helps connect with external data sources are conducive to the framework. Fig.1 below outlines comprehensive Auto DQ capabilities and features.

Fig 1: Dataplex Auto DQ Features

This aforementioned framework enables the definition of data quality objectives and measurable methodologies for each domain and offers the following key features:

  • Centralized Management: Consistent data quality standards and measurement methodologies are established for the organization.
  • Self-Service Capabilities: Teams can independently define and manage their data quality processes and metrics. This eliminates the need for each team to maintain its own infrastructure or processes for data quality management.
  • Scalability: The framework can accommodate growth and the integration of new data sources.

In this article, we will delve into the technical aspects of the federated and scalable data quality framework empowered by Dataplex Auto DQ and BQ in conjunction with other Google cloud services.

The Federated DQ Framework: The Balance is a must

A federated DQ framework can provide a solution by empowering organizations to manage DQ in a centralized and consistent manner, while still allowing for domain-specific autonomy.

A key aspect of a federated DQ framework is the balance between centralization and autonomy. On the one hand, it is important to have centralized infrastructure and service to ensure consistency and prevent data quality silos. On the other hand, it is also important to empower domain teams with the autonomy to define and manage DQ assessments specific to their domain.

Fig 2: Federated DQ Framework Conceptual Architecture

Centralized Infrastructure and Service

The centralized infrastructure and service should provide core DQaaS functionalities such as: Data Profiling and analysis, Data Cleansing and transformation, Data Quality monitoring and reporting. This centralized infrastructure should be managed by a dedicated data quality team that is responsible for setting and enforcing data quality standards across the organization.

Domain-Owned Data Assessments

Domain teams should be empowered to define and manage DQ assessments specific to their domain. This allows them to tailor DQ rules to their unique needs while still adhering to overall data quality standards. The Domain team should also be responsible for monitoring and remediating data quality issues and will be measured across those KPIs.

Why This Balance Matters

Inconsistency in data quality across domains can cripple effective data management and hinder organizational goals. A centralized yet democratized Data Quality Framework addresses this by:

  • Establishing Common Data Quality Metrics
  • Streamlining Data Quality Management
  • Supporting Federated Rule Management
  • Ensuring Scalability and Collaboration
  • AI assistance for DQ rules recommendation and binding

This balanced approach entrusts domain teams with ownership while mitigating administrative burdens and infrastructure complexities while ensuring consistent data quality throughout the organization.

The Federated DQ Framework: Reference Architecture

Fig 3: Federated DQ Framework Technical Reference Architecture

Fig 3 illustrates the technical end-to-end architecture of the federated Data Quality framework and incorporates the following components:

  1. Defining/Developing DQ Rules Dataplex provides multiple interfaces to help data product owners define and modify data quality (DQ) rules. DQ rules can be specified using JSON or YAML files, enabling users to focus on formalizing validation rules without writing code. Users can also define reusable rules based on a standard template.Currently, BigQuery Machine Learning (BQML) models have the ability to be embedded within data quality (DQ) rules. For instance, anomaly detection BQML models can be effortlessly integrated into DQ rules. Additionally, the reusability of rules can be achieved through the definition of common rules via user-defined functions (UDFs). AI will assist in making advanced rule recommendations in the future. Learn more about rule definitions here.
  2. Review & Approval Data stewards ensure data quality (DQ) rules meet organizational requirements, especially for critical and sensitive data elements. They verify the presence of required DQ rules and ensure proper definition and adherence to best practices. AI can assist with validation in the future. Existing review and approval processes can be leveraged for this purpose.
  3. DQ-as-a-Code Data quality configurations can be managed as code using Terraform, GitHub, and Cloud Build. This enables version control and seamless transfer of rules between environments. You can learn more about it here.
  4. Data Store Automatic Data Quality (DQ) is supported for BigQuery tables, views, BigLake, and external tables created on Google Cloud Storage (GCS) data. The extensive capabilities of BigQuery as a data repository allow it to efficiently connect to external data sources, making data movement unnecessary and thus facilitating seamless data interoperability. Data Quality also facilitates the execution of data quality rules on various data storage formats, such as Parquet, Avro, Iceberg, and others, through BigLake.
  5. Orchestration You can schedule data quality checks through the serverless scheduler in Dataplex, or use the Dataplex API through external schedulers like Cloud Composer for pipeline integration. The composer DAG, which can be created as part of the framework, is in charge of orchestrating the data quality execution and then tagging workflow based on the schedule/dependency details supplied by the data product owner. Incorporating a Dataplex DQ job as a component of your data pipelines is made simpler by the ease with which a data engineering job can be an upstream or downstream dependency. Remember that composer can be used as a federated component, allowing us to use either a central operations composer instance or a domain-specific composer instance.
  6. Data Quality execution engine Dataplex DQ engines execute data quality rules against data in Google Cloud Storage and BigQuery(extend to other data sources through federation), requiring no infrastructure setup. They are fully managed, serverless, and support incremental checks.
  7. Incident management & Actions Dataplex DQ logs are forwarded to cloud logging. Failed logs can be monitored by cloud monitoring. The notification channel alerts relevant parties through emails, pagers, slack, or web hooks. Additional actions such as bug filing and assignment can be done too. Learn more about monitoring and alerting here.
  8. Analysis & reporting Data quality results, generated by the Data Quality service, can be published to BigQuery directly or extracted using the API. This enables the organization to democratize this data, making it available for programmatic reporting and further analysis, such as time series. Additionally, column and row level security controls can be implemented at the domain level to ensure appropriate access control.
  9. Actionable dashboard You can provide an actionable dashboard to all the end users so they can monitor and assess the data quality across multiple dimensions. We can use any visualization tool we want on this DQ result data in BigQuery. Google Cloud also offers Looker Data Studio which is a simpler solution, and free, so you can try it any time.
Example 1: Data quality statistics as part of your centralized data products dashboard.
Example2: Data quality statistics can provide more detailed and time-series-based insights.

10. Auto tagging for discovery Data quality (DQ) results can be disseminated to the catalog via Auto DQ, and indexed custom tags that are incorporated into the datasets can be created. This will facilitate the search for and discovery of high-quality data products. Incorporating data quality metrics as part of metadata can help foster trust in data and enable users to utilize it with complete confidence and effortless search-ability.

DQ tag displaying overall and dimesion specific quality scores

The Federated DQ Framework: Key Benefits

  • Improved data quality A federated DQ framework can help to improve data quality by ensuring that all data sources are subject to the same DQ standards. This can lead to improved decision making and better business outcomes.
  • Reduced costs A federated DQ framework can help to reduce costs by eliminating the need for duplicate DQ tools and processes. This can also lead to improved operational efficiency.
  • Increased agility A federated DQ framework can also help increase agility by enabling organizations to respond quickly to changing data quality requirements. This can give organizations a competitive advantage in today’s fast-paced business environment.

The Federated DQ Framework: Guiding tenets

As you start adapting a federated data quality approach, here are a few guiding principles to abide by:

  • Create a unified data quality framework that serves as a guiding principle for all data initiatives within the organization. This framework should provide a consistent approach to data quality measurement, ensuring that all data is evaluated using the same standards and metrics.
  • Objectively measure data quality Data quality should be measured objectively using quantifiable metrics such as accuracy, completeness, consistency, and timeliness. This ensures that data quality is not based on subjective opinions or perceptions.
  • Align data quality with business objectives Data quality objectives should be aligned with the organization’s overall business goals and objectives. This ensures that data quality efforts are focused on improving the areas that matter most to the business.
  • Clearly communicate data quality objectives and metrics Data quality objectives and metrics should be clearly communicated to all stakeholders, including data producers, data consumers, and data stewards. This ensures that everyone is aware of the expectations for data quality and can take appropriate actions to meet those expectations.
  • Provide training on data quality Data producers and consumers should be provided with training on data quality best practices. This training should cover topics such as data cleansing, data validation, and data standardization.
  • Define clear data quality scores for each data subject or domain Data quality scores should be defined for each data subject or domain within the scope of the data mesh. These scores should be based on the data quality objectives and metrics that have been established.
  • Use data quality scores to monitor progress Data quality scores can be used to monitor progress over time and identify areas where improvement is needed. This information can be used to make informed decisions about data quality initiatives.
  • Central operations should provide a standard reusable, rules-based template Central operations should provide a standard reusable, rules-based template that allows domain teams to define customized data quality rules. This template should include a set of common data quality rules that can be applied to all data, as well as a mechanism for defining custom rules for specific data domains.
  • Enable domain teams to define customized rules Domain teams should be able to define customized data quality rules that are specific to their needs. This flexibility is important for ensuring that data quality rules are tailored to the unique characteristics of each data domain.
  • Each critical data element or field should have its own set of data quality rules Each critical data element or field should have its own set of data quality rules that are applied to rows or summarized at the table level. These rules should be designed to ensure that data is accurate, complete, consistent, and timely.
  • Data quality rules should be enforced at the point of data entry Data quality rules should be enforced at the point of data entry to prevent low-quality data from entering the system. This can be done through a variety of methods, such as data validation checks, data cleansing routines, and data standardization procedures.
  • Aim for high-performing data stewardship within the organization High-performing data stewardship is essential for ensuring that data quality is maintained over time. Data stewards should be responsible for monitoring data quality, identifying and resolving data quality issues, and promoting a culture of data quality within their organization.
  • Train data stewards per domain Data stewards should be trained on data quality best practices and the specific data quality requirements of their domain. This training will enable data stewards to effectively manage data quality within their domain.
  • Central operations should work with domain teams to identify applicable data quality metrics Central operations should work with domain teams to identify the data quality metrics that are most relevant to their business needs. This collaboration is important for ensuring that data quality efforts are focused on the areas that matter most to the business.
  • Support domain teams in defining rules for establishing a data quality baseline Central operations should support domain teams in defining the rules that will be used to establish a data quality baseline. This support can include providing guidance on data quality best practices and developing tools and templates to help domain teams define their rules.
  • Facilitate the adoption of data quality improvements Central operations should facilitate the adoption of data quality improvements by providing training, resources, and support to domain teams. This support can help domain teams to implement data quality improvements quickly and effectively.
  • Data quality is an ever-evolving process Data quality is an ever-evolving process that is influenced by changes in the data landscape. As such, it is important to adopt an iterative approach to data

Conclusion

The federated and scalable data quality framework is a powerful tool that can empowers enterprises to manage data quality at scale, reduce costs and increase agility. By leveraging the combined strengths of Dataplex Auto DQ, BigQuery, and other Google Cloud services, organizations can gain a comprehensive understanding of their data landscape, identify and resolve data quality issues efficiently, and drive better decision making.

What’s next?

If you feel like you want to learn more about it, contact me at manaswini.maharana@gmail.com.

--

--