Avoid These 3 Mistakes When Working with a Synapse Dedicated SQL Pool
As we showed in our recent cloud data warehouse benchmark, the Synapse dedicated SQL pool is one of the most powerful and cost-effective cloud warehousing solutions on the market, but getting optimal performance and value from it requires careful considerations at every step of the workflow.
Below, I’ll explore three points where an improper configuration can lead to sub-optimal utilization of the resources thereby increasing the costs unnecessarily. Luckily, these are all avoidable if you know where to look and what to do — let me show you.
1. Using inappropriate table distribution methods
As our aforementioned benchmark revealed, selecting the correct table distribution methods in a Synapse dedicated SQL pool is essential for optimizing performance. The two main considerations when choosing a distribution type are parallel processing and data shuffling.
When thinking about parallel processing, you must ensure that each node is working on a similarly sized chunk of the data to promote fast query response times and improve overall system performance. The round-robin method randomly distributes the data among the nodes to ensure equal-sized chunks for each node. It’s easy to set up, but might not be appropriate for frequent joins, as related data might end up in different nodes and data shuffling would be needed.
To avoid data shuffling (moving data between nodes) during frequently occurring joins, setting up hash distribution for the tables on their common columns is a powerful option. This distribution type uses a hash function on the selected column(s) to ensure that the same values from multiple tables end up on the same node. This drastically reduces the processing need as the joins can be executed within the nodes.
2. Under- or over-configuring workload management
Workload management in Synapse dedicated SQL pools is done by setting up workload groups that enable you to control the amount of system resources allocated to different user or application workloads. This is to ensure that critical workloads receive the necessary resources and performance while preventing resource contention among concurrent queries.
One way under-configured workload groups can lead to problems is with ETL processes that load data into clustered columnstore indexes. This operation is memory-intensive due to the data compression involved, so you should always make sure that enough resources are available. Memory pressure during this process can lead to diminished index quality that in turn will make subsequent queries less efficient. To estimate memory requirements for this process, follow the instructions here.
An example of over-configured workload management is when resources are reserved for a workload group even when there are no active requests in it. When setting up a workload group, you can define the minimum % of resources that are always reserved for the group. This is very useful in cases when you have to ensure that SLAs are met, but it should always be done with caution. If such strict isolation of resources is not required, it’s better to use a shared resource pool in combination with different workload importances.
3. Using clustered columnstore index for staging
By default, dedicated SQL pool will set up tables using clustered columnstore indexes. These are highly compressed, column-based data structures optimized for analytical workloads on large tables, but they can be expensive to build.
When loading data into a clustered columnstore index, the rows are first split into row groups (batches) which then are separated into column segments before each of these segments get compressed.
Because building a clustered columnstore index is a resource intensive process, using them for temporary staging tables creates an unnecessary overhead without providing any benefits. When comparing the throughput of bulk loading processes for clustered columnstore indexes and heap tables, a 2–3x difference can be observed (link).
While avoiding the aforementioned 3 mistakes when working with a Synapse dedicated SQL pool can help you save cost by utilizing your resources more efficiently, finding the right combination of tools for your business needs can be challenging. We at Starschema can help you identify the technologies that will best serve your use cases and fine-tune them for optimal performance. Reach out — we’d love to talk.
To learn how Synapse fares against Databricks, Snowflake, Redshift and BigQuery fare and scale in terms of query performance, cost per performance and differentiating feature value, see the results of our extensive testing in this white paper:
About the author
Marton is a highly skilled data professional with experience in commercial aviation and electronics manufacturing. He holds a Master’s degree in data science from Tilburg University and previously worked as a data scientist with the top management of a major European airline, where his transformational work helped save millions of euros for the company. In his current work as a data engineer at Starschema, he helps Fortune 500 companies build data platforms to unlock greater value from their data. Connect with Marton on LinkedIn.
REACH OUT TO STARSCHEMA HERE:
READ MORE STORIES FROM STARSCHEMA: