Exploring the Medallion Architecture in Microsoft Fabric

Mariusz Kujawski
9 min readFeb 5, 2024

--

The Medallion architecture stands out as one of the most popular frameworks for constructing a data lake or lakehouse. Its core concept revolves around a three-layer structure, often referred to as the “multi-hop” architecture. The primary objective of this design is to facilitate the logical and physical organization of data, thereby enhancing its quality and structure as it traverses through each layer.

In addition to the foundational layers, such as Bronze, Silver, and Gold, it’s important to note that the Medallion architecture is flexible enough to accommodate custom layers tailored to specific needs. For instance, one might choose to incorporate a landing zone before the Bronze layer to streamline initial data ingestion or introduce an additional secured layer to secure sensitive information.

It’s essential to view the Medallion architecture as a versatile framework that can be adapted and refined to suit diverse requirements. Embracing this pattern while remaining open to adjustments ensures the successful implementation of a robust data management system.

Medallion Architecture

Bronze

The Bronze layer serves as the initial repository for data collected from external source systems. The table structures within this layer mirror those of the source systems in their original state, augmented with additional metadata columns capturing details like the load date/time and process ID.

Silver

Within the Silver layer of the data lake or lakehouse architecture, data originating from the Bronze layer undergoes processes of matching, merging, conforming, and cleansing. These preparations enable the Silver layer to offer a comprehensive “Enterprise view” of crucial business entities, concepts, and transactions — such as master customers, stores, non-duplicated transactions, and cross-reference tables.

The Silver layer consolidates data from diverse sources into a unified Enterprise view, empowering self-service analytics for spontaneous reporting, advanced analytics, and machine learning. It functions as a primary resource for Departmental Analysts, Data Engineers, and Data Scientists, enabling them to commence projects and analyses aimed at resolving business challenges via enterprise and departmental data projects.

Gold

Data in this layer are organized in a consumption-ready format. This layer is designed to fulfill reporting needs. Here, we can store modeled data in Kimball’s style star schema or Inmon style data marts. We can also retain our fact and dimensions tables similarly to traditional data warehouses. The purpose of this layer is to support reporting and BI tools.

Data organization in Data Lake / Data Lakehouse

Below, you can find a diagram presenting how we can organize data in our data lakehouse. As mentioned in the Bronze layer, we store data as it is, which means that we can store data in various formats: structured, semi-structured, or unstructured format. When moving data to the Silver layer, one of the requirements is to convert the data to a unified format, such as Delta or optionally Parquet. We still keep data in the same hierarchy as it was in bronze. Finally, when moving data to the Gold layer, we utilize ETL (Extract, Transform, Load) processes to convert it into a business-oriented data model. The Gold layer contains facts and dimensions. This structure requires data transformation using ETL processes, resulting in a change in physical data organization to data product and entity hierarchy.

Data organization in Lakehouse

Microsoft Fabric

Fabric is Microsoft’s new Software-as-a-Service (SaaS) analytics platform. It serves as an all-in-one environment for a data platform, encompassing data ingestion, visualization, and analytics from Data Engineers to Power BI users. Fabric consolidates Azure Data Factory, Azure Synapse Analytics, Azure Lakehouse, and Power BI into a single compact platform, eliminating the need for additional activities such as resource setup, maintenance, and configuration. With Fabric, you can create an end-to-end analytics solution quickly, with built-in capabilities ranging from Data Integration and Engineering to Data Science and real-time analytics.

Through Fabric, Microsoft adopts the Data Lakehouse architecture with a vision of a Mesh structure. OneLake serves as the foundation of this architecture, offering the ability to categorize data using ‘Domains’ and ‘Workspaces.’ OneLake is powered by Azure Data Lake Gen2 under the hood. With Delta Lake open-source format support, Fabric enables easy construction of a Lakehouse platform.

Data Integration in Fabric

Fabric facilitates integration with data sources through Data Factory Pipelines and Copy Activity, enabling connections to various sources such as relational databases and public cloud storage (e.g., S3, Azure Storage, GCP Cloud). Alternatively, shortcuts can be used to establish connections to other cloud storage solutions like Azure Storage or AWS S3. These features allow for importing required data into the Bronze layer using pipelines or Apache Spark Notebooks.

Data Flow Demo

In this demonstration, I’ll explain how to create a data flow process across layers using Data Factory for orchestration, Apache Spark Notebooks for data cleaning, and T-SQL for the ETL process. To store data in different layers, Fabric Lakehouse will be used to organize the data lakehouse in the Medallion architecture. While the gold layer in Fabric can be created using either Fabric Lakehouse or Warehouse, I’ll opt for Warehouse due to its T-SQL capabilities. This will showcase the various options available for creating an ETL process using SQL or PySpark to populate a data model.

Fabric Data Flow

In the initial step, I utilized the “New” button located in the workspace menu to create the raw, silver lakehouse, and gold Warehouse layers.

The Bronze Layer in Fabric

For the purpose of this post, let’s assume I have an external application that pushes files to the landing area I created in my Bronze layer.

The raw lakehouse

Within this layer, I store incoming files containing information about clients, transactions, applications, products, etc. It serves as the Bronze layer where data is initially stored, and I’ve also set up a shortcut to another storage location.

To transition data to the Silver layer, I’ve developed a notebook containing PySpark code. This code is responsible for cleaning the data, removing sensitive information such as card numbers, and converting columns to the required data types.

Branch

df_branch = spark.createDataFrame(pd.read_excel("/lakehouse/default/Files/langing/branch.xlsx"))
df_branch = df_branch.drop("Unnamed: 0")
df_branch.mode("overwrite").write.saveAsTable("silver.branch")

Transactions

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def mask_func(colVal):
if colVal is not None:
if len(colVal)>=16:
charList=list(colVal)
charList[4:12]='x'*8
return "".join(charList)
else:
return colVal
else:
return colVal

mask_func_udf = udf(mask_func, StringType())

df = spark.read.format("csv").option("header","true").load("Files/langing/card_operations.csv")
df = df.withColumn("card_number",mask_func_udf(df["card_number"]))
df = df.withColumn("transaction_amount",df["transaction_amount"].cast("decimal"))
df = df.withColumn("transaction_date",df["transaction_date"].cast("date"))

df.write.mode("overwrite").saveAsTable("silver.transactions")

Applications

df = spark.read.format("csv").option("header","true").load("Files/langing/client_applications.csv")
df = df.withColumn("application_date",df["application_date"].cast("date"))

df.write.mode("overwrite").saveAsTable("silver.applications")

Clients

df = spark.read.format("csv").option("header","true").load("Files/langing/clients.csv")

df = df.withColumn("credit_card_number",mask_func_udf(df["credit_card_number"]))
df = df.withColumn("birth_date", df["birth_date"].cast("date"))

df.write.mode("overwrite").saveAsTable("silver.clients")

As illustrated in the code snippet, the code reads data from the landing directory within the Bronze Lakehouse, performs necessary transformations, and ultimately saves it as a table in the Silver Lakehouse. The “savesAsTable” command is used to save the data in Delta format within the Silver Lakehouse.

Alternatively, another option is to create a table using a SparkSQL statement. This allows for the definition of a schema for the table, ensuring proper data types and structures.

%%sql
create table silver.branch(
branch string,
manager string

) USING DELTA;

The Silver Layer in Fabric

Following the execution of the notebook, I observed new tables in my Silver Lakehouse Explorer. These tables are now ready for querying.

the silver layer lakehouse

Fabric Lakehouse provides two options for accessing these tables. The first option is through PySpark, where I can read data from a table, manipulate it, transform it, and perform various other operations.

df = spark.sql("SELECT * FROM silver.applications LIMIT 1000")
display(df)

# OR

df = spark.read.table("silver.applications")
display(df)

Another option supported by Fabric is the SQL endpoint, which is automatically generated for each Lakehouse in Fabric. Using this SQL endpoint, we can query data from the silver layer using T-SQL.

SQL endpoint

As presented in the screen below, the structure of my silver database is similar to SQL Server. For those not familiar with Spark, this SQL-style access can provide a familiar way to interact with the data.

SQL endpoint query result

The Gold Layer in Fabric

For the gold layer of my lakehouse, I opted to utilize Fabric Warehouse, which supports T-SQL and doesn’t require Spark for data querying and transformations. With Fabric Warehouse, I can create my ETL process in a more traditional fashion using SQL stored procedures and SQL language.

To simplify access to data in the Silver layer, I added its SQL endpoint in my Gold Warehouse. This simple step enables me to access silver data using cross-database queries, allowing for querying and the creation of new tables in my gold layer using the CREATE TABLE AS command (CTAS) or INSERT INTO command. In essence, I can leverage the silver layer as a staging area in a classic data warehouse setup.

In my case, I initially created tables using the CREATE TABLE command to establish my facts and dimension tables, following the Kimball star schema. This schema will support my analytics queries and Power BI Reports.

It’s worth mentioning that Warehouse supports columns, low levels of security, and dynamic masking, which can help enhance security.

CREATE TABLE [dbo].[transation]
(
[transaction_id] [bigint] NOT NULL,
[client_id] [bigint] NOT NULL,
[product_id] [bigint] NOT NULL,
[merchant_id] [bigint] NOT NULL,
[bk_transaction_id] [varchar](8000) NOT NULL,
[client_num] [varchar](8000) NOT NULL,
[transaction_amount] [decimal](18,0) NOT NULL,
[transaction_date] [date] NOT NULL,
[status] [varchar](8000) NOT NULL
)
GO

When my tables were in place, I proceeded to create stored procedures using T-SQL to populate my new tables.

CREATE PROCEDURE [dbo].[load_transation] 
AS
INSERT INTO transation
(
[transaction_id],
[client_id],
[product_id],
[merchant_id],
[bk_transaction_id],
[client_num] [varchar],
[transaction_amount],
[transaction_date],
[status] [varchar]
)

select
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS transaction_id,
b.client_id,
c.product_id,
d.merchant_id,
a.transaction_id as bk_transaction_id,
a.client_num,
cast(a.transaction_amount as decimal) transaction_amount,
cast(a.transaction_date as date) transaction_date,
a.status

from
silver.dbo.transactions a
LEFT JOIN client b on a.client_num = b.client_number
LEFT JOIN application c on b.client_id = c.client_id
LEFT JOIN merchant d on a.merchant = d.merchant
WHERE
NOT EXISTS (
SELECT 1 FROM gold.dbo.transation tra WHERE a.transaction_id = tra.transaction_id
)

As demonstrated in the code snippet, I created a procedure that generates surrogate keys, normalizes data, and inserts new rows into the transaction table. This resembles an SQL Server statement, making it familiar to those with origins from SQL Server or Synapse dedicated pools.

After creating all tables and procedures, they were visible in the Warehouse Explorer.

The gold layer in Fabric Warehouse

To orchestrate the entire process, Data Factory can be used. Through Data Factory pipelines, we can easily configure activities to execute the procedures created in the Gold layer using the Store procedure activity.

Data Factory Pipelines

Summary

The demo highlighted the seamless implementation of Medallion architecture in Microsoft Fabric. Users can effortlessly create essential resources and focus on script development without the complexities of managing connections and resource setup. Fabric’s versatility offers various methods for accessing data: users adept in Spark can utilize PySpark for table and script creation, while those versed in T-SQL can opt for Warehouse for SQL-based solutions. While Warehouse shares similarities with engines like SQL Server or Synapse Dedicated Pool, it’s important to note that its underlying storage mechanism involves Delta files, which play a role in performance.

If you found this article insightful, I invite you to express your appreciation by liking it on LinkedIn and clicking the ‘clap’ button. Your support is greatly valued. For any questions or advice, feel free to reach out to me on LinkedIn.

--

--