Making Data Flow Faster

Carl Follows

Published in

Ricoh Digital Services

4 min readJun 9, 2020

Size your connections to get the most from Synapse Analytics

A new host

Over the years of building data platforms I’ve used lots of technologies to load and transform data, but the data warehouse itself remained pretty constant. Most of the time hosting it on SQL Server then subsequently on Azure SQL database, which I saw mostly as a means to improve the cost of ownership and simplify disaster recovery.

When Microsoft first converted their Parallel Data Warehouse product to be a cloud offering (originally called Azure SQL Data Warehouse) I didn’t switch across. This was for several reasons which could probably be summarised as:

Too much technical challenge for the benefit ?

But some the features of what’s now called Azure Synapse Analytics have at last drawn me in: the ability of Polybase to integrate with the data lake combined with the separation of compute costs from storage costs.
With support for database projects added in Visual Studio 2019 Synapse has now become my default DW hosting option.

There’s still the occasional technical challenge but they seem more surmountable. Like making the most of the platform, how to make the data flow fastest for the smallest running cost.

Scale Up Vs Scale Out

There’s two standard approaches to handling increasing scale: buy bigger (scale up) or buy more (scale out).

Hosting in the cloud means that scaling up is only a financial challenge, there’s no need to replatform, you just need to slide the lever up. The question any decent architect (or budget holder) would be thinking is

Why do I need to pay more ?

Scaling out means distributing (sharding) your data so different fractions of your data resides on different databases (e.g. split by customer). This allows the processing load to be shared out, but typically this isn’t trivial to set up or maintain.

With Synapse you can take advantage of both approaches. Choosing a good distribution for your data gives you the benefit of scaling out, whilst the ability to scale, pause and resume means you only pay for the compute you need.

So you configure your optimal distribution but are you using that compute effectively ?

Connection Power

To understand what’s limiting the ability of the database engine to execute queries would traditionally start with looking at the slow queries and understand what indexes support the query. With Synapse Analytics there can also be problems with skew in table distribution but we should also consider how much compute resources (i.e. memory) are allocated to the query.

This is because unlike in Azure SQL we need to tell Synapse what types of queries we’re expecting to execute on the connection, since it will allocate the resources to the command before attempting to execute it.

Without scaling up there is a limited set of resources which means…

Trade-off between available memory and concurrency.

Effectively we need to make the Scale Up Vs Out decision for each of our connections. We tell Synapse Analytics this by allocating each user to a resource class, these resource classes indicate what percentage of memory is allocated to executing each query.

So for a Synapse Analytics instance running with Service Level DW400c the memory allocations for the dynamic resource classes are.

smallrc 6.25%
mediumrc 10%
largerc 22%
xlargerc 70%

There is a temptation to think that in a data warehouse the queries are churning lots of data and therefore will need lots of resources, so they should be assigned to xlargerc. But with 70% of the resources allocated to each query that effectively makes the workload single threaded. Even setting lowering this down to largerc means only 4 concurrent connections can execute in parallel.

So what’s best ?

To get the most data through the system we first need to understand.

Will a few powerful connections move the data faster than many small ones ?

Unfortunately the answer is, as often with performance, “it depends”.
Fortunately the monitoring built into the Azure Portal helps us.

Here you can see multiple queries being thrown at Synapse running with a performance level of DW400c. I’ve set the grain of the charts to 1 hour and with many of my queries taking over a minute to run I don’t need to worry about counting queries that start and finish within the minute.
The connections are logged on with a user assigned to the largerc using Azure Data Factory to execute many queries in parallel. Despite this the top right chart shows that we only get 4 concurrent queries, briefly 5 when a simple query was executed by smallrc user.

This shows that when we try and push more data through the platform, the queries just start queuing up. What’s more, our use of DWU (Data Warehouse Units) is maxed out at 230 despite us paying for 400. This suggests we should reduce the memory allocated to our connection to allow greater concurrency. Or even use mediumrc connection for smaller queries to increase concurrency and largerc for those which demand more resource.

As you’re beginning to see, reaching the optimum tuning requires lots of patience to rerun the workload and analyse the improvement. Whilst I haven’t got a go-fast button, hopefully this has helped you to start understanding how your workload is being throttled. To really tune it you’ll need to invest some time to understand the inbuilt Dynamic Management Views.

Making Data Flow Faster

A new host

Scale Up Vs Scale Out

Connection Power

So what’s best ?

Written by Carl Follows