How to determine the right data distribution of a table using dedicated SQL pool in Azure Synapse Analytics
Azure synapse analytics is a limitless service provided by Microsoft Azure which brings together SQL for data warehousing, Spark tools for Parallel computing of Big Data, and Data Explorer pool for Log and Time Series analytics. It also integrates Pipeline for data integration with AzureML, PowerBI, SQL, and other front-end services.
Synapse SQL pool supports T-SQL which enables Data Warehousing and Data virtualization with Streaming and Machine Learning scenarios. SQL pool offers Serverless and Dedicated SQL resource models. One of the advantages of Synapse SQL pool is that it supports a built-in streaming tool to store data into SQL tables from different cloud services and then AI/ ML integrates with SQL using T-SQL. It also supports Databricks notebooks which run on Apache spark. This is known as spark pool where you can create a table using Spark SQL and build ETL/ELT using Data Pipeline.
This article is for designing hash-distributed, round-robin distributed tables and Replicated tables in dedicated SQL pools. For more information about the synapse SQL pool, you can refer to Azure documentation. Here, I will discuss below a few scenarios where we can use different data distribution.
In Synapse SQL, all queries from end-users are submitted to compute node for parallel processing. Each query is divided into 60 parallel queries to run on every 60 distributions. All distributions are managed by a single node if there is a single compute node. If you have 60 compute nodes, one distribution query is managed by each node. Here is the diagram showing 6 compute nodes managing 60 distributions, each computes node performing 10 query distributions.
Distribution tables play important role in synapse SQL architecture to enhance query performance. Hence, it is worth learning about the usage of different distribution tables.
Azure Synapse SQL provides three different types of Distribution: (1) Hash Distribution, (2) Round-Robin Distribution and, (3) Replicated.
Hash Distributed Tables: The hash distribution table uses a Hash function to distribute all rows across available compute nodes. A hash distributed table can deliver the highest query performance for joins and aggregations on a large fact table using a good candidate column for hashing. This type of distribution table is created to minimize the data movement across distributions. Because similar values tend to fall on the same distribution, this leads to reduction in data re-shaffling between compute nodes.
A large Fact table with Star Schema is a good choice for a Hash Distributed table. The table size on disk is more than 2 GB and the table has frequent INSERT, UPDATE, and DELETE operations is a good candidate for Hash distribution.
A hash distributed table has a distribution column called a Hash Key or distribution column. In order to get the best performance, choose a good candidate column for distribution which determines how the rows are distributed. For best performance, all distributions have approximately the same number of rows. To balance parallel processing, select a distribution column that distributes data evenly across distributions and minimizes data movement during processing in compute nodes.
Below properties can consider determining the distribution column.
· Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique value while others may end with zero values.
· Does not have NULLs or has only a few NULLs. For an extreme example, if all values in the column are NULL, all the rows are assigned to the same distribution. As a result, query processing is skewed to one distribution and does not benefit from parallel processing.
· Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions does all the processing work.
· Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently used in the GROUP BY clause.
· Is not used in WHERE clauses. This could narrow the query to not run on all the distributions
Round-Robin Distributed Table: The default distribution type for a table in the Azure Synapse SQL pool is Round-Robin distribution. Once you created a table you cannot change it, but you can re-create a table with CREATE TABLE AS SELECT (CTAS).
The Round-Robin Distributed table distributes all tables’ rows evenly across all distributions and the assignment of rows to distribution is random. Similar records are not guaranteed to fall within the same distribution which sometimes invokes data re-shuffling across distributions.
Temporary staging tables and the table with no joining key are good choices of Round-Robin Distributed table as they can load very fast.
Round-Robin Distribution can be used in the following scenarios:
· When getting started as a simple starting point since it is the default
· If there is no obvious joining key
· If there is no good candidate column for hash distributing the table
· If the table does not share a common join key with other tables
· If the join is less significant than other joins in the query
· When the table is a temporary staging table
Replicated Tables: Replicated tables are best for the small table which provides the fastest performance. In replicated distribution, a table is fully replicated in cache on each compute node. Small dimension tables and a table less than 2GB are good candidates for the replicated type of distribution.
If smaller dimension tables are replicated in every compute node, then while joining the distributed data of the fact table with the dimension tables each Compute node will have the required dimension tables’ related data available within itself. This results in improved performance of joins. There is a slight overhead of copying all data to every Compute node in order to get the benefits of replicated tables.
This article provides some ideas about where and when a distribution table can use.