The Medallion Architecture

Omar LARAQUI
3 min readSep 7, 2022

--

The Medallion Architecture

Data is a hot topic in the business world. Everyone wants to talk about the insights and value they can derive from data. There’s a good reason for that; Data is one of the most valuable resources available to today’s companies.

🧞 Who Rules The Data, Rules The World.

With the increased volume of the data, data processing and analysis are becoming more and more time-consuming. Companies are looking beyond the traditional data architectures to meet their on-demand analytical needs.

Databricks tackled the problem using the delta lake framework combined with the Medallion Architecture.

The Delta Lake framework

Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.

Specifically, Delta Lake offers:

  • ACID transactions on Spark with Serializable isolation levels that ensure readers to never see inconsistent data.
  • Leverages Spark distributed processing power using the Scalable metadata handling
  • A table in Delta Lake is a batch table as well as a streaming source using the Streaming and batch unification principle
  • Schema enforcement with an automatic handling of schema variations to prevent insertion of bad records during ingestion.
  • Time travel with data versioning that enables rollbacks and full historical audit trails.
  • Merge, update and delete operations that enable complex use cases like change-data-capture, slowly-changing-dimension operations, streaming upserts…

The Medallion Architecture

Creating a multi layer lakehouse allow companies to enhance data quality among the different levels and at the same time fulfill their business needs. Unstructured and dirty data are easily ingested using a scalable and secure pipeline to output the highest quality enriched data. From bronze to gold, data is collected, cleaned, enhanced and aggregated to give the most valuable insights to the business. It’s a massive improvement over traditional data architectures.

Let’s jump into the architecture and see each stage of the architecture in detail.

The Medallion Architecture

🥉 The bronze layer

This layer contains raw data ingested from various sources. Whether they are batch or streaming, this layer stores it in a distributed file store that can hold high volumes of large files in various formats. Data may be timestamped in this layer or saved as it is.

The Bronze Layer

🥈 The silver layer

Taking the most value of data implies to clean and enrich it. In this layer, the business logic is integrated to clean, join with lookup tables, replace null values, filter to save the enriched data.

The Silver Layer

🥇 The gold layer

Insights are valuable when they answer business questions. This layer is in charge of aggregating the data on top of the silver tables and serve it for BI ad-hoc reporting tools and Machine Learning applications.

The Gold Layer

As data volume and variety continue to rise, bringing reliability and improved performance to data lakes by providing ACID transactions and unifying streaming and batch transactions on top of existing cloud data stores is crucial. By building connectors with the most popular compute engines and technologies, the appeal of Delta Lake will continue to increase — driving more growth in the community and rapid adoption of the technology across the most innovative and largest enterprises in the world.

Enjoy learning 💡

--

--

Omar LARAQUI

Lead Data Engineer | Senior Cloud Data Engineer | Analytics & Data Integration | Independent Consultant