Azure Synapse Analytics — Key Considerations while building your data warehouse
A collection of best practices for Azure Synapse Analytics compiled in one location for quick reference.
Azure Synapse Analytics (Previously SQL Datawarehouse) offers Petabytes of scaling and unifies enterprise Data Warehousing and Big Data Analytics.
Synapse SQL uses MPP (Massively Parallel Processing) Architecture leveraging on SQL Pools which are a collection of analytic resources and the size of the pools is determined by DWU(Datawarehouse Units) and this also dictates the pricing. SQL Synapse also brings the ability to pause compute and only keep the storage active when needed.
Data in a table is split across 60 distributions and the distribution strategy can be Round-robin, Hash-distributed or Replicated. Then number of compute nodes can range from 1 to 60, depending on the level of performance required. As the number of compute nodes or DWU increases, it results in less distributions per compute node improving the overall performance.
The documentation for Synapse SQL is quite exhaustive and it can be found here, with the complete pdf spanning over 450 pages. I have consolidated the links below that have helped me understand the best practices and considerations while designing the data warehouse as quick reference guides.
Cheatsheet
Best Practices
Designing Tables
Indexing Tables
Heaps
Clustered Columnstore Index
- https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver15#can-i-combine-rowstore-and-columnstore-on-the-same-table
Designing Distributed Tables — Round Robin vs Hash vs Replicated
- https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute
- https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-overview#common-distribution-methods-for-tables
Hash-Distribution — Use this distribution methodology for large fact tables with clustered columnstore index.
Replicated — Use this distribution strategy for smaller tables (<2GB), which can be replicated to each compute node, typically used for Dimension Tables.
Round-Robin — Use this strategy for staging tables or for loading data or when there is no clear choice for distribution.
Load data from external tables using polybase
I hope this article is useful and can provide a useful starting point to data engineers and architects alike while working on their Synapse SQL warehouse.