Dedicated SQL pools in Azure Synapse Analytics

Published in

Betacom

5 min readApr 6, 2021

Introduction

In this article we will discuss the core of Azure Synapse Analytics: dedicated SQL pools. In the first section we will learn their architecture, whereas in the remaining three sections we will go through the different properties a table in a dedicated SQL pool can have.

Architecture

Dedicated SQL pool refers to the enterprise data warehousing features that are available in Azure Synapse Analytics. Their architecture is node-based and can be described as follows.

The Control Node represents the brain of the Dedicated SQL pool and is the front-end with applications and connections. The MPP (Massively Parallel Processing) engine runs on the Control Node to optimize and coordinate the execution of queries in parallel. Compute Nodes execute queries and are used by distributing the work equally among them. They can be from 1 to 60, depending on the Data Warehouse Units (DWU) chosen. Please recall that a DWU is the abstract and normalized measure of computing resources and performance. Services offered by Azure Synapse Analytics are paid in terms of DWU.

By DMS we mean Data Movement Service, or the internal service (at system level) that moves data between nodes to allow the execution of queries in parallel and obtain accurate results.

Each query is divided into 60 parts, each performed by a distribution, which is the basic storage and processing unit for parallel queries that run on distributed data.

Sharding Patterns

Data in the Dedicate SQL pool is distributed in shards in order to optimize system performance. It is possible to choose which sharding pattern to use when creating a table:

Replicated,
Round-Robin (default),
Hash.

In Replicated tables a complete copy of the table is stored on each Compute Node. Thus it will not be necessary to transfer data between nodes before running the queries. Additional storage space is required and overhead occurs when writing data, which makes large tables impractical.

The data is distributed evenly in a Round-Robin distributed table but without further optimization. A random distribution is chosen first and then the row buffers are assigned sequentially to distributions. It is the simplest distribution to create tables and have fast performances if used as a staging table for loading data. However, query performance can take additional time.

In a Hash-distributed table definition, one of the columns is designated as a distribution column and the hash function uses the values of that column to assign each row to a distribution. A hash-distributed table provides maximum query performance for joins and aggregations on large tables.

A good distribution key

distributes data equally;
is used to aggregate (all the records that will then be grouped should be in the same distribution);
is used to JOIN (this must be especially true for large Fact Tables);
cannot be changed (to change it, you must delete it and enter the data again);
has more than 60 distinct values, so all distributions get the same amount of data.

How to choose which sharding pattern to use?

Table types

The main indexing options for a table are:

Clustered columnstore indexes (default),
Heap tables,
Clustered index.

Clustered columnstore indexes are the best choice for large tables, but are not recommended for tables with fewer than 60 million rows.

Tables are organized by columns and this allows for high compression
Great for read-only operations.
Secondary indexes cannot be defined.
A rowgroup is a group of rows that are compressed in the columnstore format. A column of data within the rowgroup is called a column segment and ideally should contain 1 million rows.

Heap tables are recommended for temporary or staging tables and for small lookup tables with fewer than 60 million rows.

The data is not sorted and in fact there is no index.
There is no data compression.
The definition of secondary indexes is allowed.

Clustered index is the most common type of indexing and is also called Clustered B-tree. It is an index that is saved in the same order in which the data was indexed.

There is no compression.
It makes the execution of queries very fast with very selective filters.
The definition of secondary indexes is allowed.

How to choose the type of table?

Partitioning

A partitioned table is a table where data is divided into small groups within Data Buckets. Usually the partition is based on a column representing the loading date of the records in the table.

The advantages of using partitions are:

Increased efficiency and performance in data loading, thanks to the deletion, switching and merging of partitions;
Better performance in query execution;
Easy load and unload;
Easy maintenance, rebuilding and organization.

There are though some practices you should follow when defining a partition:

you should not to create tables with too many partitions since in some cases this could decrease the performance;
a good partition scheme has 10 or a few hundred partitions;
for clustered columnstore tables, it is important to consider how many rows there are in each partition.

⚠️ ATTENTION ⚠️
Before creating the partitions, tables are already divided into 60 distributions by the MPP engine.

Let’s see an example. Choosing to partition a clustered columnstore table according to the day of the year when records were inserted will lead to 365 partitions. The total number of rows will then be as follows:

60 distributions * 365 partitions = 21 900 Data Buckets
21 900 Data Buckets * 1M (ideal segment size) = 21 900 000 000 rows

It is therefore recommended to use a lower granularity, such as weekly or monthly ones.

Conclusion

To sum up, let’s recap how to apply these principles to fact and dimension tables.

FACT TABLES

If large, it is recommended to choose a clustered columnstore index;
Use a hash keyed distribution, as long as the data is split equally across all shards;
Partition the table only if it is large enough to fill each column segment.

DIMENSION TABLES

If there is no join key candidate, use a hash or round-robin distribution;
Choose the clustered columnstore index only if the table is large;
Add secondary indexes for alternate join columns;
Partitioning is not recommended.

You should now be able to define your own tables into a dedicated SQL pool. Please refer to this tutorial for further information about creating a dedicated SQL pool and load data into it.