Data Quality in the Lakehouse

Data Engineering is all about Data Quality Management

Frank Munz
Google Cloud - Community
2 min readApr 29, 2022

--

My From Zero to Hero article about the Databricks Lakehouse on GCP turned out more successful than I ever thought. So here is another, a more conceptual article about data quality, that applies to the Databricks Lakehouse on any cloud.

The major goal of modern data engineering is to distill data with a quality that is good enough for downstream analytics and AI. The role of the data engineer is to build and run the machinery that creates the high-fidelity data product — all the way from ingestion to monetization. Data quality and the Lakehouse are deeply interwoven. Within the Lakehouse data quality is achieved on different levels:

  • On a technical level, data quality is guaranteed by enforcing and evolving schemas for data storage (such as Delta Table scheme enforcement and schema evolution) and ingestion (for example Autoloader with schema detection).
  • On an architectural level, data quality is often achieved by implementing a medallion architecture. A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture, e.g. from Bronze to Silver to Gold layer tables.
Databricks Medallion Architecture
  • Data flow pipelines in the medallion architecture can use expectations to enforce and observe data quality.
  • Data governance implemented by the Databricks Unity Catalog comes with robust data quality management with built-in quality controls, testing, monitoring, and enforcement to ensure accurate and useful data is available for downstream BI, analytics, and machine learning workloads.

Follow me here on Medium and clap for this article if you enjoyed reading it. If you enjoy more cloud-based data science, data engineering, and AI/ML feel free to follow me on Twitter (or LinkedIn).

--

--

Frank Munz
Google Cloud - Community

Cloudy things, large-scale data & compute. Twitter @frankmunz. Former Tech Evangelist @awscloud, Principal @Databricks now. personal opinions here. #devrel ❤️.