Azure SQL Data Warehouse deep dive into data distribution

Ian Choy
9 min readSep 17, 2019

In this article, we will discuss data distribution in SQL Data Warehouse and learn how to explore distribution via the explain plan. The goal is to provide an initial understanding of the kind of distribution operations being run behind the scenes and what action we can take to improve query performance.

Let’s start with an architecture overview of Microsoft’s Azure SQL Data Warehouse, where the fundamental basis is very similar to other MPP databases in the market.

Azure SQL Data Warehouse architecture
Azure SQL Data Warehouse Architecture

The Control Node is where user/application connects to SQL Data Warehouse via it’s supported drivers such as ADO.NET, ODBC, JDBC, etc. and connection string. Similar to SQL Server, it also uses port 1433 by default. All queries from the user connection is then passed from Control Node to the Compute Nodes for parallel query execution, hence the term “Massive Parallel Processing”. Each individual Compute Nodes consists of multiple distributions which are SQL Databases that processes the data and then passing it back to the Control Node for aggregation. The Data Movement Service (DMS) is a window service within each node that performs all these data movements.

What makes SQL Data Warehouse special is that the actual data files (.mdb) are stored in Azure Premium Blob storage which is de-coupled from the Compute Nodes. Because it’s de-coupled, users can scale the Compute Nodes as well as pause it without affecting the data files.

Since SQL Data Warehouse is charge by the number of Compute Nodes running, it’s a common best practice to automate the scaling and start-up of the Compute Nodes based on business needs to manage cost, such as scaling up during month-end reporting or pausing the nodes during after hours.

Data is distributed between nodes using either hash-distribution or round-robin tables. Data can also be replicated to all nodes using replicated tables. Understanding and planning where the data is stored are important to achieve high query performance in an MPP database.

In Hash distribution, a column is selected as the represented distribution column during table creation. Rows with similar distribution column value is placed in similar node distributions whenever possible, hence keeping these data close together. For example below, the ProductKey is selected as the distribution column for both Store_Sales and Web_Sales data. When a Join happens between both these tables, data can easily be access within its distribution with minimal data movement.

Hash distributed table example
Hash distributed table example

In round-robin distribution, data is randomly loaded in each node distribution sequentially. Replicated tables are basically a full copy of the table in each compute node.

Let’s dive in deeper to see how these works based on the Stack Overflow database I loaded into SQL Data Warehouse. All tables are initially created using Round-robin distribution. This is the query used. It finds out how many posted questions by users and the number of comments within each post.

select count(p.Id) posts, sum(p.CommentCount) comments,
u.Id, u.DisplayName, u.Age, u.CreationDate, u.AboutMe
from Users u inner join Posts p on u.Id = p.OwnerUserId
where PostTypeId = 1
group by u.Id, u.DisplayName, u.Age, u.CreationDate, u.AboutMe
order by posts desc

A clustered index table is created instead of a clustered columnstore index table which is the preferred choice for large tables. I was not able to go with that due to this limitation.

Statement fail due to nvarchar(max)
Statement fail due to nvarchar(max)

LOB data types such as nvarchar(max) pose this issue. The reason given by Microsoft is that a data warehouse shouldn’t be storing such data. No matter, we will continue with clustered index. A user account created to perform the query using a higher resource class.

-- Connect to MASTER database and create a login
CREATE LOGIN QueryUser WITH PASSWORD = '<password>';
CREATE USER QueryUser FOR LOGIN QueryUser;
-- Connect to SQL DW database and create a database user, 1 USER per 1 LOGIN
CREATE USER QueryUser FOR LOGIN QueryUser;
-- Grant permission
GRANT CONTROL ON DATABASE::StackOverflow to QueryUser
-- Increace resource class
EXEC sp_addrolemember 'largerc', 'QueryUser'

The admin account user is limited to using 100MB of memory per query and cannot be changed, hence a user account must be created to perform queries else performance can be seriously impacted. A largerc” resource provides 800MB of memory per distribution (there’s always 60 distributions in SQL DW) for DWU 400. For data loading, a User with at least “mediumrc” should be used to ensure there are at least 60+ writers used.

Memory allocation per distribution
Memory allocation per distribution

DWUs are computing capacities that are assigned to SQL Data Warehouse. DWU 400 means 4 compute nodes are assigned, while DW 1000 means 10 compute nodes assigned. The higher the DWUs, the more compute nodes assigned hence higher compute performance but also equates to higher cost.

Statistics is also created on all the tables before the query is run. We need to do this because unlike SQL Server, SQL DW doesn’t automatically create statistics for you.

Create statistics
Create statistics

The query took about 6:40 minutes on DWU 400. Let’s look at the execution plan for this query.

Results return
Results return

Unfortunately, there’s no graphical execution plan available for SQL Data Warehouse like what’s available for SQL Server using SQL Server Management Studio (SSMS). Instead, an “explain” keyword can be used to export the plan in XML format. Btw, as of this writing, although SSMS can be used to connect and query SQL Data Warehouse, it cannot produce a link to the XML explain plan. This only works in Visual Studio for now.

Explain select statement
Explain select statement

Once the query is executed, the first thing to do is to search for “dsql_operation operation_type”. This makes it easier to pinpoint the activities of interest that are happening within SQL Data Warehouse. Right at the top, it states the number of compute nodes used is 4. number_nodes=”4". There’s always 60 distributions in SQL Data Warehouse and for this case, it’s 15 distributions per node (60 dist / 4 nodes = 15 dist per node). The query ran is also shown in the plan. There’s a total of 9 operations being executed. total_number_operations=”9".

Looking at the 2nd highlighted dsl_operation operation_type=”ON”, a temp table named TEMP_ID_203 is created in all compute nodes. This temp table is used as a destination table in the next operation.

Explain plan dsql_operation_type=”ON”
Explain plan dsql_operation_type=”ON”

The dsql_operation, “BROADCAST_MOVE” copies the source statement queried data to all nodes, basically duplicating the data. The operation will not be required if these tables are configured as replicated tables. Having duplicated small dimension tables is a good way to minimize data movement between distributions although this requires more storage space.

The next dsql_operation, “SHUFFLE_MOVE” means data is being prepared to be moved between distributions. This usually happens during a join or aggregation operation where the tables are deemed “not distribution compatible” due to its data being distributed based on a column that’s not compatible between tables.

Generally, we try to eliminate this type of operation where possible since moving data around effects performance. In reality, because there are usually many different kinds of queries hitting the data warehouse, there’s a tradeoff we need to take between keeping data distributed and reducing shuffle move. The shuffle column which is the distribution column for the destination table is also shown, in this case, it’s the User table Id column.

Explain plan operation_type=”SHUFFLE_MOVE”
Explain plan operation_type=”SHUFFLE_MOVE”

Finally, there’s the operation to Drop the temp tables and the “RETURN” operation which execute the query that returns the results to the client.

This query which ran for more than 6 minutes, should be able to be improved further just by having the Posts table (the larger table between the two) as a Distributed Hash table. There’s no ALTER TABLE statement in SQL DW hence a CTAS operation is used. The Id column which holds distinct values is used as the distribution column. This ensure data is distributed across the nodes evenly to avoid data skew.

-- CTAS re-create table to hash distribution column & clustered Index (time=1:27 hour)
-- columnstore index can't be use due to nvarchar(max)
CREATE TABLE Posts_Id
WITH (CLUSTERED INDEX(Id),
DISTRIBUTION = HASH(Id))
AS
SELECT * FROM Posts OPTION (LABEL = 'CTAS : Posts_Id')
-- Rename the Tables
RENAME OBJECT Posts TO Posts_bk;
RENAME OBJECT Posts_Id TO Posts

The same query is ran again and this time it took only 2:40 minutes to complete. Looking at the explain plan again for this query, there’s nothing noticeably different from the last one. Shuffling of data is still required because the shuffle column is on the User table Id column (for Group By) rather than the Posts table Id column which was selected as the distributed column. A broadcast move operation is still performed by SQL DW. Regardless, just by having Posts as a Hash Distributed table, query time is cut down significantly for this query.

Let’s explore another query. This query is quite similar to the last one. It also uses the Posts table but does a join to the Votes table instead. The Votes table is distributed using round-robin, while the Posts table is the same Hash Distributed table we created previously.

select count(v.Id) votes, p.Title, p.ViewCount, p.CreationDate
from Posts p inner join Votes v on p.Id = v.PostId
where PostTypeId = 1
group by p.Title, p.ViewCount, p.CreationDate
order by votes desc

The explain plan shows there’s 2 shuffle move being performed. The first shuffle operation is done on the Votes table using its PostId column and the 2nd operation is on inner select statements using the Posts table Title column as the shuffle column.

Explain plan showing 2 shuffle move operation
Explain plan showing 2 shuffle move operation

It shows the first Shuffle_Move operation is used to move Votes data into similar distributions as the Posts data based on the PostId join key. If the Votes table is distributed using PostID, it’s data would already reside in the same distribution as the Posts table eliminating the need for a shuffle move. To test this theory out, the Votes table is changed to a Hash Distributed table, distributed by its PostID column to eliminate this Shuffle_Move operation.

-- CTAS re-create table to hash distribution column & columnstore Index (time= x min)
CREATE TABLE Votes_PostId
WITH (CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(PostId))
AS
SELECT * FROM Votes OPTION (LABEL = 'CTAS : Votes_PostsId')

The execution plan now shows only one Shuffle_Move operation, where the shuffling of the Votes table is no longer required. The group by statement still requires a shuffle move operation because the group by column itself is not distribution compatible. A Hash Match is likely done using the group by column as the key to building a hash table. Having one less Shuffle_Move operation does improve query performance slightly although not by much (couple seconds) for this particular query.

Explain plan showing 1 shuffle move operation
Explain plan showing 1 shuffle move operation

Hopefully, the examples above by executing 2 different queries and going over the explain plan gives us a good initial understanding of managing distributed data. Understanding how data is distributed is important to ensure data is properly divided across multiple nodes to take advantage of MPP database scalability.

This fundamental knowledge holds regardless of vendor offering and differentiates MPP databases from SMP (Symmetric Multi-Processing) databases like SQL Server.

Thanks, claps and cheers.

--

--