MultiHop Architecture in Azure Databricks
Prerequisites: The reader of this article should have understanding of ETL process, Azure Databricks and Azure Data Factory
Multi-hop architecture is a design approach for organizing data in Delta warehouse. The architecture consists of three main layers: Bronze, Silver and Gold. Pre-landing layer can also be included if the data needs to be copied from client system into delta platform.
This approach is quite similar to the traditional ETL approach wherein the data travels through different layers like pre-landing, staging and landing.
Considerations: This articles gives an overview of multihop architecture and does not focus on the detailed implementation steps or hands on.
Advantages of MultiHop architecture:
- Streaming and batch loads can be combined into one
- Implements ETL process in cloud using fast processing and memory optimization techniques
- Change Data Capture
- Better data management and security features at each layer
- User level access at different layers
Pre-Landing Layer:
The purpose of this layer is to fetch the data in raw form. Data is collected from different sources like csv, database etc and inserted into Azure Data Lake Storage Gen2 using ADF/Databricks workflows.
Data from the source is extracted in parquet format. The folder structure can be designed based on the source name. Parquet format offers several advantages like data compression and decompression, increased data throughput and performance.
This layer is totally optional as the data from the source can be directly inserted into the bronze layer. But, adding this layer gives an advantage of any job breakage while reading from source or any locks applied on source.
Bronze Layer:
Raw data from prelanding/ source is inserted directly into the bronze layer in delta format. If loading from prelanding layer, then data needs to be converted from parquet to delta format. If loading from source, then data can be directly written to the delta table.
The data in this layer is truncate and load. This layer contains the latest data from the source depending on how we are fetching data from source be it incremental or full load.
The transformations on the data like Julian check, Null check, deduplication are performed in this layer.
Silver Layer:
Silver layer does the data combination of the received data. Whether the load type is full load or incremental, the data from the bronze layer is directly merged with the silver layer using Merge queries.
Gold Layer:
Gold layer acts as the data fabrication layer. This layer provides with all aggregations and extra calculations based on specific source requirement.
Conclusion:
Adding an extra layer like prelanding in the start gives more flexibility to implement business logics. However, there are several scenarios that does not require this layer to implement. It totally depends on the requirement whether to use extra layer or not.