Databricks Medallion Architecture

Ritu Prajapati
3 min readOct 14, 2022

--

In the current era where a huge amount of data is getting generated each second. Data is the most valuable thing for the business.

Managing a history of data like structured, semi-structured, and unstructured, and streaming or querying on it getting complex.

Databricks is here to provide you ease with the data. Dealing with complex data storing and querying got easier with Databricks Medallion Architecture.

Medallion Architecture which is also known as Delta Architecture or Multi-Hop Architecture in Databricks.

The purpose of this is to store data logically in a lakehouse.

It leverages tight integration between delta lake and spark structure streaming to allow efficient incremental data processing.

If you want to create a fully replayable history of data in different states and want to query at different levels, then multi-hop architecture is the choice. It will help by reducing the manual time spent on unmanageable data. Also, it enables Data Analysts, Data Engineers, and Data Scientists to work at different levels and helps to easily integrate their work.

So, let's talk about how it works:

The medallion architecture is into three steps :

1. BRONZE LAYER:

It's the very first level where the data is directly ingested from the source(like JSON, RDBMS data, IoT data, etc) the intention here is kept to maintain a fully unprocessed history of data.

  • It's a copy of raw data ingested directly.
  • Provides efficient storage and querying
  • Replaces traditional data lake
  • In simple words, it is what, when, and from where data is loaded into the lakehouse.

2. SILVER LAYER:

It's the second layer where data is ingested into the Silver layer from the Bronze layer, here data provides a more refined view from the bronze layer. We can join fields from various tables to enrich streaming records.

  • It minimizes data storage complexity, latency, and redundancy.
  • Data is clean and have ACID transaction guarantees
  • Optimizes ETL Throughput and analytic query performance.
  • Preserves grain of original data (without aggregations).
  • Eliminate duplicate records
  • Data quality checks, corrupt data quarantined

3. GOLD LAYER:

It's the Third layer which contains completely refined data for analysis. Data here provide business-level aggregation often used for reporting.

  • It is cleaned data to create standard summary statistics.
  • The data in this layer typically with aggregations
  • Reduces strain on production systems.
  • The data is used for ML applications, reporting, dashboards, and ad hoc analytics.

Now, let's take an example to understand the different parts with structured streaming queries is performing a hop-

  1. Ingesting data from Raw to Bronze -

spark
.readStream
.load(rawSalesLocation)
.writeStream
.option(“checkpointLocation”,checkpointPath)
.outputMode(“append”)
.table(“uncleanedSales”)

2. Ingesting data from Bronze to Silver -

spark
.table(“sales”)
.withColumn(“avgPrice”, col(“sales”)/col(“units”))
.writeStream
.option(“checkpointLocation”, checkpointPath)
.outputMode(“append”)
.table(“cleanedSales”)

3. Ingesting data from Silver to Gold -

spark
.table(“sales”)
.groupBy(“store”)
.agg(sum(“sales”))
.writeStream
.option(“checkpointLocation”,checkpointPath)
.outputMode(“complete”)
.table(“aggregatedSales”)

Conclusion-
Databricks Medallion Architecture store data and works in multiple layers such as Bronze, silver, and Gold layers. Moving from the raw level and then cleaning the data for the final business level. This help in going back and querying the data without losing much time on manual data clumping.

The source for this article is Databricks.
Hope you have found this useful, stay tuned for more readings on data.
Happy learning ☺️

--

--